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Digital Transmission Over Cross-Coupled Linear 
Channels 


By J. SALZ* 
(Manuscript received January 14, 1985) 


For a multiuser data communications system operating over a mutually 
cross-coupled linear channel with additive noise sources, we determine the 
following: (1) a linear cross-coupled receiver processor (filter) that yields the 
least-mean-squared error between the desired outputs and the actual outputs, 
and (2) a cross-coupled transmitting filter that optimally distributes the total 
available power among the different users, as well as the total available 
frequency spectrum. The structure of the optimizing filters is similar to the 
known 2 x 2 case encountered in problems associated with digital transmission 
over dually polarized radio channels. 


I. INTRODUCTION 


A variety of communication channels can be modeled as multi- 
input, multi-output, mutually cross-coupled linear networks with ad- 
ditive noise sources. A few examples of communications systems 
operating over such channels are dually polarized radio systems, 
frequency/time-division multiplexing with crosstalk, cordless PBXs, 
spread-spectrum multiuser systems, and multisensor radar/sonar sys- 
tems. In many applications it is beneficial to design cross-coupled 
transmitters and receivers that take advantage of the inherent mutual 
interferences. The chief purpose of this paper is to explore these issues 
from a theoretical point of view. 

The general problem we address follows. We consider an N input- 


* AT&T Bell Laboratories. 
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port, N output-port, linear transmission channel characterized by an 
N x N complex matrix frequency transfer function, C(w), where the 
entries in C(w), Cj(w), 1, 7 =1...N, represent the transfer character- 
istics from input 1 to output j. Digital data signals Dt), i=1---N 
are intended for simultaneous transmission over this medium. The 
general problem we address is as follows: how does one jointly optimize 
the 2N? entries of cross-coupled linear receiving and transmitting 
matrix filters when the performance criterion is total Mean-Squared 
Error (MSE) subject to a constraint on the total average transmitted 
power? The general setup is shown schematically in Fig. 1. 

This paper includes generalizations to N(N = 2) dimensions of 
earlier work dealing with digital data transmission over dually polar- 
ized radio channels.” As far as can be determined, these generaliza- 
tions have not been reported in the open literature. We found, however, 
two early papers** dealing with multi-input, multi-output, communi- 
cations that might be of interest to the reader in connection with our 
problem. For additive Gaussian noise sources the Shannon capacity 
of the type of channel considered here has also been determined.° 

After formulating the problem in the next section, we derive the 
optimal matrix receiving filter structure and provide a closed-form 
formula for the least MSE. In Section IV we address and solve the 
optimum transmitter problem, while the last section has our conclud- 
ing remarks. 


Il. PROBLEM FORMULATION 


Consider a linear communications medium characterized by N? (N 
is an arbitrary integer) real impulse response functions, 


fo 2) ; d ; 
hy(t) = { Hy(w)e™ oo lJ = iF Zee N, (1) 


where H;(w), 1, j = 1 --- N are the complex frequency transfer 
characteristics from input | to output j. The representation in (1) 
characterizes a linear medium with N inputs and N outputs where 








V4 (t) SAMPLE 
1 
















s TRANSMITTER CHANNEL RECEIVER 
INPUTS @ P(w) C(w) wiw) e OUTPUTS 
be N XWN MATRIX N XN MATRIX N XW MATRIX ° 


POWER 


> POWER = CONSTANT 


Fig. 1—Multi-input, multi-output data communications. 
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hj(t) is interpreted as the output of the jth medium when the /th 
input is a unit impulse. More compactly, let h(t) represent the N x N 
impulse response matrix with entries h(t). Now the diagonal entries 
stand for the direct-channel impulse responses, while the off-diagonal 
terms are the cross-interference impulse responses. 

In the present situation we have in mind that N real data signals, 


D(t)= Yavg(t-nT), l=1---N, (2) 


where {a‘?} are the data symbols, are intended for simultaneous 
transmission over the N channels. Mutual dispersive cross coupling, 
intersymbol interference, and noise distort these signals, and the 
purpose here is to devise transmitting and receiving processors to 
mitigate against these interferences. 

The present model also accommodates bandpass coherent data 
signals in which case (2) represents the baseband-equivalent signals 
with complex data symbols. In our general formulation, however, we 
can get by with only real data symbols since complex numbers are 
isomorphic to 2 x 2 real matrices.* 

Returning to the general formulation, the N data sequences are now 
represented by the column vectors 


fae 

A, = : fall n, 
ay 

and the data signal is thus represented by an N-dimensional vector, 


D(t) = ¥ Ang(t — nT). (3) 


Consequently, the channel output signal can be represented as the 
vector time function, 


S(t) = ¥ H(t — nT)A,, (4) 
where the matrix H(t) is the convolution 


H(t) = [- g(t — r)h(7)dz, (5) 


and the N x N matrix h(t) has entries defined in (1). 
For the sake of clarity, we restrict our treatment to N x N trans- 
mitter and receiver filters, while we recognize applications where it 


* References 1 and 2 exploit this isomorphism effectively in the treatment of quad- 
rature amplitude modulated systems over bandpass channels. 
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would be necessary to handle nonsquare matrices. This does not, 
however, restrict our approach, as will become evident. 

Proceeding with the analysis, we presume that a noise vector p(t) is 
added to (4) and the sum signal is passed through an N x N matrix 
receiver filter denoted by W(t). A representative sample of the output 
vector taken at ¢t = 0 (without loss of generality) yields 


So = UoAo + XY UnAn + v0, (6) 
n#¥0 


where 


U, = { W(-7)H(7 — nT)dr 


and 


| W (—1)v(7)dr. 


We now regard the vector So [a suitably quantized version of So in 
(6)] as an estimate of the data vector Ao. Thus the system designer 
has the freedom to choose the receiver matrix W(t) and a transmitting 
filter (yet unspecified) to make So as close as possible, in some 
reasonable sense, to the desired quantity. While the most objective 
sense in which this can be made close to Ao is 


probability [So ¥ Ao] <6 


for a given 6, it is not a mathematically tractable quantity to work 
with, and therefore a simpler cost function is sought. As is well known, 
the probability of error cannot be expressed exactly even when the 
added noise is assumed to be Gaussian—a difficulty caused by the 
presence of intersymbol and cross-channel interference. Here we em- 
ploy a simple cost-function, least-mean-squared error between Sp and 
Ao. This criterion, which lends itself to mathematical analysis, can 
also be used to upper bound the probability of error when the added 
noise is Gaussian. 

Returning to the mathematical problem at hand, we thus define the 
error vectors e as the difference between So, eq. (6), and the desired 
data vector Ao. The total Mean-Squared Error (MSE) then is 


MSE = E{e*e} = tr[E{ee"}], (7) 
where 
c= (Up = I)Ao + y U, An + Vo. (8) 
n¥0 
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In (7) tr stands for the trace of a matrix, ‘ the complex conjugate 
transpose, and #(-) the mathematical expectation with respect to all 
random variables. 

Without loss of generality, we assume that the data symbols as well 
as the vector sequences A,, are independent and identically distributed, 
as are the added noise components in p(t). With these assumptions 
the detailed evaluation of (7) becomes 
MSE = Mee = uf! —2 { W(-7)H(7r)dr7 


Od 





+ 2 | W(—1)W'(—1r)dr 


+ > il W(-7)H(7 — nar | Hi(7 - ntyWH nar (9) 


where 
E{A,A}} = oal, 
E{v(t)v*(t)} = Nol, 
and 


= 


S/F 


Recall that when the data symbols {a!?} take on values, +1 +3 + --- 
+(L—-1), Ef{a?} = of = L’? - 1/8. 

In the next section (9) is first minimized with respect to the class 
of all N x N real matrices for a given channel matrix H and parameter 
o”, and in Section IV it is further minimized with respect to the 
admissible class of transmitter filters. 


Ill. THE RECEIVER OPTIMIZATION PROBLEM 


It turns out that the mathematical machinery used in the selection 
of the optimum N x N, W(t), is the same as that for the 4 x 4 case 
developed in Ref. 2, so here we only briefly review the approach. 

Proceeding with the optimization problem, replace W(t) by 

(Wo) z (in)i, Lcd = 1, my N, (10) 
where n; are arbitrary functions of 7, and set 
me (MSE) = [0] 
Oke : 


at 6; =0,1,j7=1,---,N. 
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Now compute 


i (MSE) = tr |» il ni(7)H(7)dr + 207 { Wo(—1)nf(r)dz 


2). i W,(7)H(7 — nT)dr i Hi(7 - nah) = 0, 


n#¥0 
pa lye; (11) 


where the matrices, nf, i,j = 1, ---, N have the entry n;;(7) in the ith 
position and zero everywhere else. By a direct computation of the trace 
in eq. (11), one obtains 


a [ (H(r))jinj(7)d7 + 0? { (W(-7)) ing (7) dr 


ey il [H(r — nT)Ut]img()dt7 = 0, iG f=1,---,N. (12) 
n#0 


Since eq. (12) must hold for all functions, (7), we obtain the matrix 
integral equation that must be satisfied by the optimum matrix Wo(7), 


o’W)(-7) = H'(7) — & U,H'(r — nT), (18) 
ind 
where U, is given in eq. (6) with W(r) replaced by Wo(7). Structurally, 
W,(r7) consists of an N x N matrix-matched filter followed by a matrix- 
tapped delay line with matrix-tap coefficients U,. 
An explicit formula for the least MSE is now possible to obtain by 
first post-multiplying eq. (13) by W§(—7), integrating, and then com- 
paring the result with eq. (9). The result is 


MSE) = tr[I — Ub], (14) 


where Up is obtained by solving a set of infinite linear equations 
obtained by first post-multiplying eq. (13) by H(z — kT’) and then 
integrating. These operations yield the equations 


o°U, = R, = », U,,Re-n, all k 


n#¥0 


= RI, (15) 


where 


R, = i: H'(7)H(7 — kT)dr. (16) 
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The matrix convolutional equation, (15), is easy to solve by Fourier 
series methods. Thus, inserting the souuuien of Up, obtained from (15), 


into (14), we get 
MSE, = ie rie Bol du, (17) 





where 
R(w) = Y Rye 
k 


and 


R, = a i R (w)e "du, 
2a J-x/T 


= H'(7)H(7 — kT)dr (18) 


[from (16)]. 
Now the matrix frequency transfer characteristic is, by definition, 


H(w) = i H(t)e'‘dt, 


and, by Parseval’s theorem, (18) is put into the form 


R, =— { A (w)H(w)e** de 
Qn —00 
n/T 
_ti rit _ 2ak\ » _ 2tk\ isur 
= on ge x H (. T ) H (. T Je dw. (19) 


From this and (18) we determine 


R(w) =7>m (. - 224) fi (« - 222), (20) 


which is the matrix-folded, or aliased, channel spectrum. 

In the next section we further minimize MSBE, eq. (17), with respect 
to the transmitter matrix filter. 
IV. TRANSMITTER OPTIMIZATION 

The general optimization problem with the inclusion of the aliases, 


eq. (20), appears to be extremely complicated to solve,* and therefore 


* We refer the interested reader to Ref. 6, where a similar optimization problem is 
solved without having to make the bandlimited assumption. 
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we shall assume that the transmitter filter is strictly bandlimited to 
the Nyquist frequency, 7/T. So, without excess bandwidth, (20) re- 
duces to 


Ro) = 5 Ww), (21) 


and since the transmitter matrix filter is in cascade with the channel 
filter, we can write 


H(w) = C(w)P(w). (22) 


Equation (2) now becomes 
R(w) = = Pt(w)Ct(w)C(w) Pa), (23) 


where C(w) is the transmission medium frequency transfer character- 
istic and P(w) represents the N x N matrix transmitting filter fre- 


quency characteristics. 
Note that the average total transmitted power is proportional to 


a/T 
{ tr{P*(w) P(w)]de, 
—1/T 


and therefore the quantity we wish to minimize with respect to P(w) 
is, from (17) and (23), 


n/T 1 -1 
F= { tr E + oT PreicwowPr)| dw 


a/T 


a/T 
+ + tr[P*(w) P(w)]dw, (24) 


where 2d is a Lagrange multiplier to be determined from the power 
constraint. 

Since for each value of w, C'C is hermitian, nonnegative definite, it 
can be diagonalized by a unitary matrix y, 


C¥(w)C(w) = *(w)A(o)¥(o), (25) 


where A(w) is the diagonal matrix with entries, \; --- Ay, the eigen- 
values of C'C. By letting G = yP, Q = GG, and scaling the eigen- 
values by 1/o?T, (24) is put into the form 


u/T 
F= { bg tr{(I + QA) + AQ}dw, (26) 


where we have used an innocuous result, 
tr[I + AB]"? = tr[I + BA], 
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for any matrices A and B. (The spectrum of AB is the same as that 
of BA.) 
We now seek a minimum over all possible matrices Q such that 


x/T x/T 
f tr[I + QA]““dw = min i tr[I + QA] ‘dw (27) 
—x/T Q ~—x/T 


subject to [777 trQdw = constant. 
Since A is real and nonnegative we define 


M(w) = A”?(w)Q(w)A”?(w), (28) 
which is again diagonalized by unitary matrices U(w), 
M = Udig(6, --- dn) Ut, (29) 


where 6, -- - 6y are now the eigenvalues of M. Let the diagonal elements 
of M be d, --- dy. We then obtain from (29) 


N 
d:= ¥ bnlUn|*, i=1---N. (30) 
n=1 


This relationship is now used to prove that (27) is achieved when Q 
is diagonal. Since the integrand in (27) is positive, it suffices to 
minimize the integral point by point. 

Let Sz be the diagonal matrix formed from M by setting all off- 
diagonal entries to zero. From the definition of the trace we write 








nas 1 
tr{I + MJ]? = > i+, 
and 
as | 
tr[I a Sz = zx i+ a 


If diagonal Q renders the smallest trace, it must be shown that 
ar | ee ol 


» 


<= ° 
rer ia a erg 


The proof of (31) is facilitated by the observation from (29) and (30) 
that 








(31) 


/ 


|Uin| = 0, Ln=1.--N 


and 
N N 
Y |Unl? = XY |Unl? = 1. 
i=1 n=1 
These are precisely the necessary and sufficient conditions that 2N 
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numbers {d;, 6;} be related such that for any continuous convex function 
(x) one has 


P(d,;) + --- + B(dy) S B(6,;) + --- + B(dy). (32) 


(See Theorems 13 and 14 in Ref. 7, pp. 30-1.) Since only the diagonal 
entries in Q enter into the constraint, this result proves our assertion. 

During the course of this research I discussed this assertion with 
my colleague Hans Witsenhausen, who supplied an elegant proof using 
Schur transformations. With his permission I reproduce his note to 
me in the Appendix. 

Once we establish the fact that the optimum matrix Q is diagonal, 
we note from the definition that 


Q = GG‘ = YPPty* = diagonal, (33) 


and since its entries are nonnegative, the form of the optimum trans- 
mitter matrix is immediate, 


Po(w) = ¥1(w)Q’?(w)S(w), |w| < 
= [0], elsewhere, (34) 


where S(w) is an arbitrary unitary matrix. Thus, there are an infinite 
number of P(w)’s that yield minimum mean-squared error while yield- 
ing a fixed amount of total average transmitted power. 

The additional problem of determining the optimum values of the 
diagonal elements gq, in Q is now easy since (26) can now be written 
as 

N «/T 1 
e » —x/T ; + An(w) dn(w) 
which is easily minimized with respect to the positive quantities {q,(w)} 
using standard calculus of variation techniques, and the Lagrange 
multiplier is determined from the given power constraint. The min- 
imization of (35) follows the same procedure as in the N = 2 case, 
treated in detail in Ref. 1. 


+ rant dw, (35) 


V. CONCLUSIONS 


Fallouts from solving some very special problems associated with 
digital communications over dually polarized radio channels yielded 
general results applicable to a wide range of communications situa- 
tions. It appears that the class of minimization problems encountered 
in our formulations are well known to the mathematicians; if the 
engineers could only ask the right questions. 

From a mathematical point of view, the optimum overall system has 
an end-to-end equivalent diagonal characterization. Once this diagonal 
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(direct) structure is determined, all other problems reduce to the well- 
known scalar case. 

From an engineering point of view the optimum filter structures 
may provide valuable insights to the design of signal processors in the 
multiuser communications environment. 
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APPENDIX 


Yet Another Trace Minimization Problem 
By H. S. Witsenhausen 

Let (Q, B, 1) be a measure space, and for each w e 2 let A(w) be an 
N X N diagonal matrix with real positive integrable diagonal entries. 


Let Q denote a measurable function from 2 to nonnegative definite 
hermitian matrices, subject to 


J 2@()-duto) = ¢. ‘ (36) 


One seeks the minimum, over all such Q of 


f tr(I + Q(w)A(w))7*- du(w). (37) 


This problem is easily handled when Q is diagonal. The purpose of 
this Appendix is to show that the minimization can be reduced to this | 
case. 

Let Q? denote the diagonal matrix obtained from Q by replacing all 
off-diagonal elements by zero. 
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Theorem: If Q(-) satisfies the conditions of the problem, then Q*(-) 
also does and gives (37) no higher value. 


Proof: Obviously, Q? is hermitian, nonnegative definite and gives the 
same value in (36). Now (37) can be written 


i tr(I + M(w))'-du(w), (38) 
where 
M(w) = A¥?(w)Q(w)A*?(w) (39) 
is hermitian, nonnegative definite. 
We have 
M2(w) = A¥?(w)Q%(w).A?(w). (40) 


Thus it is enough to show that (38) cannot increase when M is replaced 
by M%, which we show for each fixed w. 








Let \i, ---, An be the eigenvalues of M, and let d;, ---, d, be the 
diagonal entries of M that are the eigenvalues of M?. 
& 1 
tr(I + M) = ; 
r( ) x Try, (41) 
= 1 
tr(I + M?)7? = : 42 
r( ) 2 itd, (42) 
The right side of (41) is a convex symmetric function of (Aj, «++, An). 


Hence it is Schur convex. (That is, f(SA) = f(A) for doubly stochastic 
S.) 

As observed by Schur, from the unitary relationship of M with its 
diagonal form 








M = U diag(\y, ..., A,JU (43) 
(see Ref. 8), it follows that 
d=S)A (44) 
with doubly stochastic S(S; = | uj |”). 
Thus, 
er ery. 5) 


which was to be proved. 

These two quantities, sometimes referred to as efficiency indices, 
will be used to compare the performance of the various equalization 
methods. 
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Tone Location by Cyclotomic Filters 


By D. HERTZ,* R. P. KURSHAN,' D. MALAH,* and 
J. T. PEOPLES? 


(Manuscript received July 11, 1984) 


In this paper we present a tone location system that estimates the frequency 
of an input tone through additions and comparisons alone. The system uses 
“cyclotomic” filters (which operate without multiplications), requiring fewer 
operations than with a conventional Discrete Fourier Transform (DFT) en- 
tailing multiplications. In the system presented here, an input tone is first 
transformed to yield two quadrature tones, which are then digitized. Processing 
occurs in successive stages at successively reduced sampling rates. During each 
decimation stage, the system is configured to provide symmetric coverage of a 
subband in which the tone has been located at the previous stage. Simulations 
demonstrate that for the case studied, an input signal-to-noise ratio (SNR;) 
in excess of 10 dB yields an output signal-to-noise ratio (SNR,) that is close 
(within 0.3 dB) to the maximum attainable SNR,, where SNR, is measured 
in terms of the reduction in frequency uncertainty. Enhanced resolution is 
demonstrated at the expense of the number of computations, while holding 
the number of decimation stages constant, by using small DFTs in place of 
the cyclotomic filters. This method still utilizes fewer computations than a 
conventional DFT (with the same number of frequency cells), with approxi- 
mately the same performance in the case of low noise. Thus, this alternative 
method is useful when circumstances prohibit using a single, large DFT. 


* Technion-Israel Institute of Technology, Haifa, Israel. This paper is derived from 
the M.Sc. thesis of D. Hertz, published in 1979, written under the direction of D. Malah 
and R. P. Kurshan (while the latter was visiting the Technion in 1976-7). Part of the 
work was completed while D. Malah was visiting AT&T Bell Laboratories in 1979-81. 
tT AT&T Bell Laboratories. * AT&T Bell Laboratories; present affiliation Bell Commu- 
nications Research, Inc. 
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I. INTRODUCTION 


In many diverse applications, it is necessary to detect and locate a 
signal appearing within selected frequency bands, particularly a signal 
comprised of a single tone. This is tantamount to estimating the 
frequency of a tone that may appear randomly in any band. Such 
detection and estimation are generally accomplished in conventional 
analog systems, using a bank of filters tuned to different, narrowband 
portions of the spectrum or using a single filter that is effectively 
swept across the bands of interest. Associated with such techniques, 
however, are the usual problems of analog processors, including un- 
predictability due to inherent variability of system components. A 
discrete-time technique is described by Cappelini et al.,!? wherein a 
given frequency band is partitioned into subbands for detection pur- 
poses. The partition into subbands is achieved through a decimation 
approach, using a single, fixed, low-pass digital filter at each decima- 
tion stage. In a variety of applications, however, when it is known that 
the given input signal contains, at most, a single spectral line in the 
frequency range of interest (e.g., in M-ary FSK demodulation’ and 
Touch-Tone telephony*), this method possesses inherent disadvan- 
tages. First, since general filtering is effected at each decimation stage, 
numerous multiplications and additions must be performed during the 
filtering operations of each stage. Second, a large amount of memory 
is required to store samples from the geometrically increasing number 
of iterated signals. These disadvantages are substantially reduced with 
a Cyclotomic Tone Location System (CTLS).*® Only the CTLS of 
Ref. 5 is presented here, since it is simpler than the CTLS of Refs. 6 
through 8, and we wish to apply this same idea to a DFT Tone 
Location System (FTLS) as well, described in Section IV. The FTLS, 
as compared with conventional DFT implementations,®”° trades per- 
formance in the presence of noise for reduced implementational com- 
plexity. 

The CTLS incorporates digital filters utilizing additions alone, 
thereby eliminating the customary computational load associated with 
multiplications. Figure 1 is a block diagram of the CTLS. Referring to 
Fig. 1, the input tone, comprising, say, a single frequency located 
randomly within a band 0 to f,/4, is first transformed by the Hilbert 
network to yield a complex tone composed of two tones in quadrature 
relationship. These two tones correspond to phase-shifted versions of 
the input tone. The complex tone is initially sampled by the digitizer 
at a rate of f, and stored in the buffer. Then, the complex tone and its 
frequency-shifted versions, from the frequency-shifting unit, are proc- 
essed by the first-order recursive filter unit. The filters are derived 
from the set of cyclotomic filters and have only a single resonance in 
the frequency range up to one-half the sampling rate. The filters are 
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Fig. 1—Block diagram of the CTLS. 
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arranged so that the resonances associated with the filters and fre- 
quency-shifted versions symmetrically cover a band of frequencies 
including the input tone frequency. Location of the subband contain- 
ing the tone is achieved in the decision unit by comparing the magni- 
tudes of the various filter outputs to each other after a fixed number 
of samples have been processed. 

It is assumed from here on that the tone is known to be present. The 
problem of detecting the presence of a tone will be commented on 
later in this paper. The CTLS locates the tone using a multistage 
process. At the first stage, using the first-order recursive filter unit, 
which symmetrically covers the band (0, f,/4), the CTLS unambigu- 
ously locates the tone either in (0, f,/8) or in (f,/8, f,/4). Once the 
tone has been located within a single subband, sampling rate reduction, 
or decimation, by a factor of two is effected. 

This particular choice of the sampling rate reduction ratio and 
subband width precludes additional spectral lines from appearing 
within the subband containing the tone. The filters are now reconfig- 
ured at the reduced sampling rate to again achieve symmetrical place- 
ment of the filter resonances across the previously isolated subband 
of width f,/8 containing the tone. The process of frequency shifting 
and filtering by the array is then repeated. 

The frequency subbands utilized for resolution at the output of this 
second stage are each of width f,/16. Again, the subband containing 
the tone is isolated, and another decimation stage is effected. The 
processing continues in this manner until the desired frequency reso- 
lution has been achieved. Moreover, a minor modification to the CTLS 
enables the system to test for a tone in any one of the other three 
bands [rather than (0, f,/4)], ie., [kf./4, (k + 1)f,/4], k = 1, 2, 3, 
provided k is known to the CTLS. 

The FTLS is an extension of the CTLS in the following way. 
Basically, the input tone to the FTLS is comprised of a frequency, 
located randomly within the band 0 to f,/2, and is converted to a 
complex tone. The quadrature tones are both initially sampled at a 
rate of f,, and then N samples of the complex tone are processed by 
an M-point DFT (M = 2”, N < M is mandatory, and the N samples 
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are padded by M — N zeros). The modulus of the M-point DFTs are 
further processed. As in the CTLS, the processing is done in successive 
stages at successively reduced sampling rates. During each decimation 
stage, the system is configured to provide symmetric coverage of a 
subband in which the tone has been located in the previous stage. 

At each decimation stage the pertinent band is partitioned into 
M/2 subbands [with S stages the initial band will be partitioned into 
(M/2)*° bands]. A minor modification to the FTLS enables the system 
to test for a tone in the other band, i.e., (f,/2, f,), provided the FTLS 
knows in which of the two subbands the tone is present. For N = 2, 3, 
and M = 4 this reduces to a CTLS, except for the sampling rate, which 
is now halved. 

The organization of the paper is as follows: Section II presents basic 
relations and a detailed description of the CTLS; Section III gives 
simulation results associated with the CTLS; Section IV presents a 
detailed description of the FTLS; and Section V presents the conclu- 
sions. 


Il. CTLS OPERATION PRINCIPLES 


For clarity of exposition, we first present an overview of the prop- 
erties of the first-order recursive filters utilized here. Next, we discuss 
in detail the time-domain and frequency-domain characteristics of one 
first-order filter (designated C,), treated as exemplary of the remaining 
filters of interest (designated C2, C,,, and C4), to illustrate funda- 
mental concepts helpful to fully comprehend the overall CTLS. Fi- 
nally, we describe the technique for exploiting the properties of the 
individual filters to form a composite tone detection system. 


2.1 Cyclotomic filters 


The properties of the cyclotomic filters discussed herein are pre- 
sented in greater detail in Kurshan and Gopinath" and Hertz.® Cyclo- 
tomic filters are a class of digital filters having only poles in the 
transfer function and, moreover, each pole lies on the unit circle. This 
means that the filters are inherently unstable and are not suitable for 
conventional filtering operations, which require the processing of 
numerous sequential samples. In fact, these filters behave more like 
resonators and it is this property that can be beneficially utilized for 
tone detection. The salient advantage of this type of resonating filter 
is that the filters of primary interest exhibit nonzero coefficient values 
(+1, +7) having a modulus of one. This implies that multiplications 
of samples by filter coefficients reduce to simple addition, subtraction, 
and signal interchange operations and, significantly, arithmetic errors 
such as round-off and truncation errors are eliminated. 

A cyclotomic filter may be described in terms of a linear recursion 
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yin +1) 





Fig. 2—Cyclotomic digital filter C, in block diagram form. 


relation whose characteristic polynomial is a cyclotomic polynomial. 
For example, for a first-order digital filter represented by the Linear 
Difference Equation (LDE) 


y(n + 1) = y(n) + x(n), (1) 


where x;(n) and y;(n) are inphase input and output sequence ele- 
ments, respectively, corresponding to the nth sample, then the char- 
acteristic equation is given by the polynomial \ — 1 (as derived from 
yin + 1) — y;(n) = 0). This polynomial is cyclotomic and has the 
designation C,(\) (or C, for simplicity): Ci(A) = \ — 1. Similarly, the 
cyclotomic polynomial C2(A) = \ + 1 yields a digital filter described 
by the LDE 


y(n + 1) = y(n) + x(n). (2) 


Other filters of special interest include two first-order, complex fil- 
ters derived as the roots of the second-order cyclotomic polynomial 
C,(\) = \? + 1. These filters also have roots on the unit circle and are 
given by (with j = V—1) Cup(A) = A —j and Cyn(A) = \ + j; they have 
the following LDE representations, respectively: 


y(n + 1) = jy(n) + x(n) (3) 
and 
y(n + 1) = -jy(n) + x7(n). (4) 


Figure 2 depicts the C, digital filter in block diagram form. Similar 
block diagrams can be derived for the other first-order filters. 


2.2 Filter characteristics 


To elucidate the desired characteristics obtained by combining 
cyclotomic-derived filters into a system architecture, we first present 
the response of a general filter to an input tone having a randomly 
distributed phase component. The input tone is presumed to have the 
analog form Acos(2z/ft + ¥), where ¥ is the random phase variable, A 
is the amplitude, and f is the tone frequency. 

The impulse response sequence of the general recursive filter is 
represented by {h(k)}, where k = 0. The output sequence elements 
may then be obtained from the convolutional relationship 
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1S ahaSP: (5) 


k=0 


where {x;(k)} is the input tone sequence obtained by sampling 
Acos(2z ft + £), t = 0, at the rate f,, that is, 


x1(k) = Acos(kv + ¥), k=0, (6) 
where 
vp = 2rf/f.. (7) 
Substituting (6) into (5) gives 
y(n) = Vaz + B2cos an (2) + el, (8) 
where 
a, = Re 5 Ae?”’h(n — inf (9) 
k=0 
Bn = Imag [i Ae*”’h(n — i] ; (10) 
k=0 


and the operators denoted “Re” and “Imag” produce the real and 
imaginary parts of the bracketed part of eqs. (9) and (10), respectively. 

Since ¥ occurs randomly within the interval (0, 27), comparison of 
| y7(n) | to a threshold may yield deleterious results due to the depend- 
ence of y, on ?. However, by utilizing the first-order filters in pairs 
(either actually or on a time-shared basis), the undesirable modulation 
effects of the random phase component may be eliminated. 

This is achieved by forming a new sample sequence {xg(k)}, found 
by sampling the quadrature tone Asin(2z ft + #), which may be derived 
through a Hilbert transform operation on the original or inphase input 
tone. The new sequence elements are given by 


x(k) = Asin(kv + ¥), k2=0. (11) 
If {xg(k)} is processed by a recursive filter identical to the one that 
processes {x;(k)}, then the new output sequence { yg(n)} has elements 


yo(n) = Va? + B2sin tan (2) + el. (12) 


An 
A squaring operation on both eqs. (8) and (12), followed by a sum- 
mation and square-root operation, yields an output Va? + 62, which 
is independent of #. 
For future discussion, it is convenient, as exemplified by the form 
of eqs. (9) and (10), to define a complex input tone having sample 
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values Ae/‘’*®), and a corresponding output sequence {z,} having 
complex element values 


n 


Zn = Y, Ael*’*% h(n — k). (13) 


k=0 


The magnitude or modulus of each element of {z,} is then given by 
lzn] =| % Ae**h(n — k)| = Von + Br (14) 
k=0 


As hereinafter utilized, the two-filter device characterized by substan- 
tially identical filters C;(i = 1, 2, 4p or 4m), which processes inphase 
and quadrature samples of an input signal in pairs, is a basic or 
fundamental element of the CTLS. 

Whereas the above discussion has focused primarily on sequence 
domain manipulations, it is helpful to visualize these manipulations 
in the frequency domain. Moreover, whereas the discussion was 
couched in terms of generalized impulse responses, particularly perti- 
nent to the subsequent discussion are the frequency domain responses 
of filters C,, Co, C4p, and Cym. The filter C;, having impulse elements 
h(n) = 1, n = 0, is treated as exemplary. 

Utilizing now the notation z(C,, n, v) to distinguish sequence ele- 
ments of {z,} so as to explicitly set forth the dependence upon C,, n 
and », we obtain the following by substituting h(n) = 1 into eq. (14): 


| 2(Ci, n, v) | =A | > e/*”| 
k=0 





or 
. [n+l 
in(*2), 
[2(Ci, 2») | =A.) SS ———"_ (15) 
sin 9 
In addition, because the impulse response of C, is (—1)", n = 0, 
|2z(C., n, v)| = |2(Cy, n, vy + x) I. (16) 


Figure 3 shows plots of |z(C,, n, v)| and | z(C2, n, v)| over the range 
(0 to z) for n = 3, that is, four samples corresponding to n = 0, 1, 2, 
and 3 have been processed. The resonating feature of the filters is 
apparent. Filter C, is symmetric with respect to vy = 0, whereas C; is 
symmetric about vy = z, and both are periodic with period 27. Since 
v= 2f/f,, v = x corresponds to a frequency f, which is one-half the 
sampling rate. Although the filter response has been illustrated with 
four samples, the filter may also be operated with two, three, five, six, 
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Fig. 3—Responses of filter pairs C, and C,, each processing real and imaginary 
sampled tones in parallel, over the range from 0 to one-half of the sampling rate with 
N = 4 samples. 


or seven samples. If more than seven samples are used, the main lobes 
of adjacent filter responses do not intersect. In fact, simulations reveal 
some enhancement in the signal-to-noise performance when more 
than four (but fewer than seven) samples are processed. 

In a similar manner, the following relations may also be derived: 


wn|(*)(»-5)| 





AC 2) | =A. (17) 
=. 
a 
sin a aa 
and 
BAAN ge 8 
sm 2 v 9 
[20C ag | SA ln (18) 
(> + =) 
sin aa ae 


Because of the manner in which C,, and C,,, are related to Ci, 


vis 
Zz (Cer nv t+ Z| Zz (Cin nvr Z| 
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= |z(Ci, n, v) ls (19) 














Figure 4 shows plots of |2(C4p, n, v)| and |z(Cam, n, v — 1)| for the 
same parameters as Fig. 3. Since cyclotomic filters have poles on the 
unit circle, their responses blow up. However, they can be used only 
because the input signal is time limited, hence the composite response 
due to the sinusoidal input can be examined and leads to eqs. (15) 
through (18). 

From the plots of Figs. 3 and 4, we conclude that the filter response 
from each two-filter device comprising identical filters C;(i = 1, 2, 4p 
or 4m) has only a single resonance in the frequency range up to one- 
half of the sampling frequency. The CTLS exploits these filter pairs 
by covering the frequency band from 0 to f,/2 (v from 0 to 7) in 
symmetric fashion. Such an arrangement is depicted in Fig. 5a. Since 
C4, and C4, are merely frequency-shifted versions of C; and it appears 
that C,, and C,,, require complex manipulations, it is important to 
determine if a modified, real sequence can serve as an input to a C; 
filter to produce an output equivalent to a C,, or C4, filter output. 

Phase shifting by +7/2 in the frequency domain is equivalent to 
introducing modulation factors in the sequence (sampled time) domain 
of the form {e*/**/*}, Thus, if the complex input sequence is modified 
by the modulation sequence, a new sequence having element values 


Ae lRvtx/2)+9] (20) 


gives rise to an output corresponding to C,, or Cy, aS appropriate. 
The inphase and quadrature sequences associated with this complex 
input sequence become, respectively, 


MAGNITUDE 
5A 


4A 


C4p (9) 


|z(C, n,6)| Cam\O-77) 





-7 -7/2 0 6 7 
72 -f6/4 f f,/2 


Fig. 4—Responses of complex, recursive filter pairs C,, and C4, in the same scale 
and over the same range of frequency as in Fig. 3. 
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Fig. 5—(a) Arrangement of three first-order recursive filters that cover the frequency 
band from 0 to one-half of the sampling rate. (b) Arrangement of four recursive filters 
that cover the band from 0 to one-half of the initial sampling rate and are symmetric 
over the four subbands shown in Fig. 5a. (c) Arrangement of eight recursive filters that 
cover the band from 0 to one-half of the initial sampling rate and are symmetric over 
the eight subbands shown in Fig. 5b. 


Acos(kv + ¥)cos(kr/2) + Asin(ky + ¢)sin(k7/2) (21) 
and 
Asin(kv + #)cos(kx/2) + Acos(kv + ¥)sin(k7/2). (22) 


Since k is a nonnegative integer, the shifted inphase sequence reduces 
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to a sequence that is periodic of period four having element values for 
k =0, 1, 2 and 3 of [from eq. (21)]: 


Acos(kvy + ); + Asin(ky + ¥); 
— Acos(kvy + ¥); + Asin(kvy + ¥). (23) 


Similarly, the shifted quadrature sequence has elements for k = 0, 1, 
2, 3 [from eq. (22)] of: 


Asin(kv + ¥); + Acos(kv + §); 
— Asin(ky + ); + Acos(kvy +). (24) 


Examination of eqs. (23) and (24) suggests that the operation of 
normalized frequency shifting by +7/2, rather than occurring through 
frequency-domain manipulations, may be straightforwardly imple- 
mented in the sequence or sample domain merely by interchanging 
and changing signs of the inphase and quadrature inputs to a filter 
pair, when appropriate. Because of the form of:eq. (20), such an 
implementation may be referred to as a modulus-one multiplier (j**) 
or frequency shifter. 

Figure 6 is a block diagram of the modulus-one multiplier for the 
case of a —1/2 frequency shift (C,,). The inputs to this unit are the 
original inphase and quadrature sequence elements. The operations of 
sign changing and line interchanging occur within this unit, as depicted 
fork=0,1,...,4,.... The frequency-shifted, inphase, and quadrature 
sequence elements, respectively, are fed to two identical C, filters 
whose outputs are squared, summed, and square rooted. Finally, this 
unit provides the response described above by C4; Ap: Note that the 
operations needed to compute the modulus Vx? + y? can be simplified 
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Fig. 6—Block diagram of a modulus-one multiplier for the case of a —7/2 frequency 
shift (C4,). 
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by using known approximations to the modulus (see Refs. 12 through 
14). An example is the one given by 


Vx? + y? = Max(|x], |yl) + & Min(|x], |y|). (25) 


a is a scalar multiplier. In Refs. 5 through 7 we used a = 0.25, so that 
multiplication by a corresponds to two shift operations. The choice of 
a = 0.25 causes only a small degradation in performance, as has been 
noted in simulations. 

Figures 5b and 5c depict how each of the two subbands in the range 
0 to z/2 of Fig. 5a may be further partitioned to isolate the tone of 
interest. Basically, the subbands are subdivided into second-stage 
subbands by reducing the sampling rate and reconfiguring the filter 
array within each subband of interest. For instance, the first-stage 
subband labeled C; in Fig. 5a has been subdivided by reducing the 
sampling rate by a factor of two and covering the old subband by C;, 
C4, which effects two second-stage subbands symmetrically dispersed 
across the original subband. Further sampling rate reduction bya 
factor of two results in the partition of Fig. 5c. The configuration and 
covering of Fig. 5b occurs within each subband during each stage of 
decimation after the first stage. 

To understand how the given input tone can be isolated by process- 
ing with consecutive stages of decimation, the steps in processing a 
single tone of frequency fp are now considered. For the complex tone 
initially sampled at a rate f,, spectral lines occur in the digital spectrum 
at fo + kf,, k = 0, +1, .... The sampling rate is chosen so that the 
tone falls within the range 0 to f,/4 (0 < fo < f,/4); thus the tone may 
be unambiguously determined to fall within one of the subbands or 
cells (0, f,/8) or (f/8, f;/4), by processing the outputs from the filter 
array or bank configured as C,, C,,. This is basically accomplished by 
comparing each filter output to another. 

If the complex tone samples are now decimated by a factor of two, 
that is, only every second value from the original set of samples is 
retained, then spectral lines appear at fo + kf,/2, k = 0, +1,.... By 
selecting the 2:1 ratio between initial sampling rate and reduced 
sampling rate, and by selecting cell widths of f,/8 for the first stage of 
sampling, no additional spectral components fall within the original 
subband containing baseband spectral line fp. This is true, even though 
aliasing occurs, because the judicious selection of cell width and 
reduced sampling rate precludes aliased spectral lines from appearing 
in the subband containing the spectral line of interest. 

The process described with respect to the decimation by a factor of 
two may continue until the desired resolution (final cell width) is 
achieved. In Figs. 5a through c three stages of processing are exempli- 
fied. The tone will thus be resolved to a cell of width f,/32. 
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For example, say that 2(f,/32) s fo S 3(f,/32) (see also the arrows 
in Figs. 5a through c). At the first stage (Fig. 5a), the output of C, is 
compared to the output of C,,; since “|C,| > | C4,|” it is concluded 
that the tone is located in 0 S fp s f,/8. At the second stage (Fig. 5b), 
the output of C, is compared to the output of C,,, and since “|C,| < 
| C,,|” it is concluded that the tone is in f,/4 = fo s f,/8. At the final 
stage, stage three (Fig. 5c), the output of C,, is compared to the output 
of Cy, and since “|C,,| > |C2|” it is concluded that the tone is in 
f;/16 < fo S 3f,/32, which is the correct presumed tone’s location. 

From the description with respect to the band from 0 to f,/4, it is 
also apparent that a tone in the bands (kf,/4, (k + 1)f,/4), k= 1, 2, 3 
could be processed in an analogous manner, provided k is known to 
the CTLS and only a single tone is present. 


Il. CTLS PERFORMANCE 


In the previous section only the principles of operation of the CTLS 
were presented, and no consideration was given to the presence of 
noise in which the tone is usually embedded. 

To combat the noise we propose two methods, based on soft decision 
and hard decision, respectively. To examine the performance of the 
CTLS, computer simulations were performed. In the simulations the 
location of the tone was initially set in the interval (0, f,/4) and was 
resolved by the CTLS into one of 64 cells. In each experiment 256 
complex data words were processed. At each stage, the filters were 
operated for a small number of samples (2 = N < 7) as compared with 
the number of data words (256). Therefore, the filter operation was 
repeated at each stage (without overlapping the data), exhausting the 
data. 
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Fig. 7—Block diagram of the configuration for generating the input noisy tone to the 
TLS. 
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Fig. 8—Probability of correct tone location vs. number of samples; the filters are 
operated at different input SNR; values. 


With the hard-decision method, the CTLS uses (at each stage) the 
maximal output of the filters (over all repetitions). With the soft- 
decision method, the average (over all repetitions) of the filter’s output 
at each stage is used. 

Figure 7 is a block diagram of the operations carried out to generate 
the noisy complex digital tone that was fed to the CTLS. {n,} is a zero 
mean white (E{n,n;} = 076,;) Gaussian sequence, which was low-pass 
filtered to eliminate out-of-band noise. 

The simulations were carried out for different values of SNR;(6, 8, 
10, 12 dB): 


A? 
E{ni(k) + na(k)}’ 


where E denotes the expectation operator. Figure 8(a through d) 
presents simulation results for the probability of locating the input 
tone in the correct cell as a function of the number of samples for 
which the filters are operated (2 < N <7). 

Another performance measure used in our simulations is the output 
signal-to-noise ratio (SNR,), which gives the reduction in frequency 
uncertainty,’ that is, 


SNR; 4 (26) 
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SNiS (27) 


In (27) v, denotes the uniformly distributed input-normalized fre- 
quency (in the range 0 to 7/2), and », is its estimate at cell width. 

It is known’? that when no errors are made in assigning the input 
signal to a cell, then 


max SNR, = (N, — 1)”, (28) 


which gives 36.12 dB for N, = 64, that is, 64 cells. Figure 9(a through 
d) shows the computer simulation results obtained for SNR, as a 
function of the number of samples at which the filters were operated 
(2< N77), at different SNR; values (in the range of 6 to 12 dB). 
From the simulation results in Figs. 8 and 9, we conclude the 
following: 
1. The soft-decision method is superior to the hard-decision method. 
2. The filters should operate with N = 6 samples. 
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Fig. 9—Output signal to noise ratio (SNR,) vs. number of samples; the filters are 
operated at different input SNR; values. 
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3. For SNR; = 10 dB and in conjunction with (1) and (2), the 
probability of locating the tone in the correct cell is in excess of 95 
percent, and SNR, is smaller than max SNR, by less than 0.3 dB. 


‘IV. DFT TONE LOCATION SYSTEM 


In this section we present the DFT Tone Location System (FTLS). 
Let {x,;}%o! be N samples of the complex tone; then 


N-1 jot pj 
X, = DFT{x;} = ») xe ™, (29) 
Le., x; =O fori=N,N+1,...,M—1(M>N is mandatory), and 
Ay = |Xzl, k=0,1,...,M-1. (30) 
It can be easily demonstrated that 
A; = | yn-1() |, (31) 
where 


2m, 
yuk) =e ™ y(k) + x, 
y(k) = 0, l=0,1,...,N—1. (82) 


Therefore, z(C;, N—1, v),i=1, 2, 4p, 4m [see eqs. (15) through (18)] 
are merely particular samples of Az, i.e., Ao, Amz, Amys, Asus. Similar 
to the derivations of (15) through (18), we get 


[el -#9) 


k= a ie ny Cae) ’ 
eRe 2 
sin 5 (» ~ 22a) 


where A and »y are the tone’s amplitude and normalized frequency, 
respectively. 

The operation principles of the FTLS are now explained. The input 
tone, which is comprised of a frequency located randomly within the 
band 0 to f,/2, is converted into a complex tone. The quadrature tones 
are both initially sampled at a rate of f, and then N samples of the 
complex tone are processed by an M-point DFT (M = 2", N<M, and 
the N samples are padded by M — N zeros). The modulus of the M- 
point DFTs are further processed so that the resonances associated 
with the DFT cover a band of frequencies, including the tone fre- 
quency, in symmetric fashion. Location of the subband containing 
the tone is achieved by finding the maximal modulus of the first 
(M/2) + 1 even (counting from zero) DFT points, Ao, say, and then, 


(33) 
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if 2k ~ 0, M/2, comparing A2,-; and Aoz4;. The technique is exempli- 
fied in Figs. 10a and b for M = 8 and N = 4. 

In the FTLS the above procedure (at stage 1) partitions the initial 
band (0, f,/2) to M/2 subbands, each of width (f,/2)/(M/2), and the 
tone lies unambiguously within one of the M/2 subbands because of 
oversampling. Once a tone has been located within the confines of a 
subband, sampling rate reduction or decimation by a factor of M/2 is 
effected. Although the tone is now undersampled, judicious choice of 
sampling rate reduction ratio and subband width precludes additional 
spectral lines from appearing within the subband containing the tone. 
At the reduced sampling rate again (as before), N samples of the 
complex tone are processed by an M-point DFT. The moduli of the 
M-point DFT are further processed to again achieve a symmetrical 
placement of the DFT’s resonances across the subband of width 
(f;/2)/(M/2) containing the tone. Now, if in the previous stage the 
tone has been located in an even subband (starting with the zeroth 


cell), find the maximal value of {Apo, As, ... , Ame}; otherwise, find the 
maximal value of {Ayj2, Amyo+2, -.. , Am-2, Ao}, say Aor, and compare 
Ao A Az A3 Ag 





0 7/4 7/2 6 T 
f,/8 f,/4 f f,/2 
(a) 


LN 
NAA 


0 W/2 rg 37/2 27 57/2 6 71/2 40 
f,/8 f,/4 f f,/2 


(b) 
Fig. 10—Operation of the FTLS for M-8, N-4: (a) first stage, and (b) second stage. 


Sv, 
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Aor-1 With Azz+1 (provided that k ~ 0, M/2, for both cases). Now (after 
stage 2) the tone is located in a subband of width (/,/2)/(M/2)*. This 
process may be repeated until, finally, at the last stage, say S, the 
frequency subbands utilized for resolving the tone are each of width 
(f;/2)/(M/2)*. Note that at each stage only the even DFT coefficients 
and two odd DFT coefficients are needed. Moreover, a minor modifi- 
cation to the FTLS enables the system to test for a tone in the other 
band, i.e., (f,/2, f;), provided the FTLS knows in which of the two 
subbands the single tone is present. Also, for N = 2, 3 and M = 4 we 
have a FTLS that resembles the CTLS, except for the sampling rate, 
which is now halved. 

It should be noted that the relation M > N is mandatory, and 
ensures that comparisons between the pertinent A/;’s will be within 
their main lobes. Note that the ideas to combat noise presented in 
Section III for the CTLS pertain as well to the FTLS. 

Now the FTLS is exemplified by choosing M = 8 and N = 4. 
Referring to Fig. 10, A, through A7 are obtained from four samples of 
the complex tone [see (30)]. Dividing the band 0 to =z into four 
subbands is carried out as follows: 

1. The maximal value of {Ao, Az, A4} is computed. From the re- 
sult we decide in which of the three subbands (0, 7/4), [/4, (37)/4], 
[(37)/4, a] the tone is present. If the tone is not in the subband 
[1/4, (37) /4], we continue to the next stage (Fig. 10b); otherwise we 

2. Compare A, and A3, and accordingly find in which of the 
two subbands, (7/4, 7/2) or [/2, (37)/4], comprising the subband 
[7/4, (37)/4] the tone is present. Now Fig. 10b depicts how each of 
the four subbands in the range 0 to z of Fig. 10a may be further parti- 
tioned to isolate the tone of interest. Basically the subbands are sub- 
divided into second-stage subbands by reducing the sampling rate by 
a factor of M/2 = 4. Now if the tone is present in an even (counting 
from 0) subband, (0, 7/4) or [2/2, (37)/4], find the maximal value 
of {Ao, Ag, Ay}. Otherwise (i.e., the tone is present in one of the 
odd subbands (7/4, 7/2) or [(37)/4, z]), find the maximal value of 
{A4, Ag, Ao}. If the even or odd subband output amplitude Apo or A, is 
maximal, then continue to stage 3. If the maximal value is A> (even 
case) or Ag (odd case), then compare A; with A3 or As with Aj, 
respectively. Now the tone has been located within a subband of width 
(f,/2)/(8/2)? = f,/32, and the processing continues in this manner 
until the desired frequency resolution is achieved. 


V. CONCLUSION 


We have presented two methods of tone location (CTLS and FTLS) 
as an alternative to conventional DFT. For a given requirement of Nc 
frequency resolution cells, a conventional DFT, which is a maximum- 
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likelihood estimator, uses a single transform of size Nc requiring 
(Nc/2)logs( Nc) multiplications, which are the main computational 
burden. The associated complexity of such multiplication is eliminated 
in the CTLS, using a decimation scheme involving filters that have 
coefficients +1, +/j, i.e., multiplier-less digital filters. For the same 
number of resolution cells, computational complexity is significantly 
reduced, at the expense of increased frequency uncertainty as a func- 
tion of increasing noise. This uncertainty is decreased in the FTLS, 
where the cyclotomic filters of the CTLS are replaced with small DFTs 
(which can be repeated several times at each decimation stage). While 
this improved performance is at the expense of increased computa- 
tional complexity (compared to the CTLS), and the resulting system 
does not have the optimal performance of a single DFT, the FTLS 
can nonetheless be preferable over a single DFT when the size of the 
latter makes it impractical to implement. To obtain Nc frequency 
resolution cells by using the FTLS in S stages, one M-point DFT is 
sufficient at each stage, and we have the relationship Nc = (M/2)°. 
Hence, using the FTLS, M = 2 VNo, and S(M/2)logo(M) multiplica- 
tions are sufficient for locating the tone. For example, suppose 
Nc = 4096 and S = 4; then M = 16, and the conventional DFT requires 
24,576 multiplications, whereas the FTLS requires only 128 multipli- 
cations. 

It follows from the simulations of the CTLS that the soft-decision 
method should be preferred and the filters should be operated for 
N = 6 samples. For input signal-to-noise ratios in excess of 10 dB the 
output signal-to-noise ratios differ from the maximal output signal- 
to-noise ratio by less than 0.3 dB. Finally, note that tone presence can 
be determined by comparing the filter outputs in the first stage to an 
appropriate threshold value (see Refs. 11 and 15 and the references 
therein). 
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In this work we have considered direct-sequence spread-spectrum transmis- 
sion for indoor wireless communications. We have modeled the indoor com- 
munications medium, which is a multipath fading channel, by a discrete set 
of Rayleigh faded paths. We have proposed new analytical techniques to 
evaluate the probability of error for the receiver terminals studied in this 
work. Numerical results reveal that, for the nondiversity receivers considered 
here, Rayleigh fading is very hostile to this form of modulation/multiple- 
access technique. The results also indicate that either some form of operation 
to prevent Rayleigh fading or diversity operation to counteract Rayleigh fading 
is required. 


I. INTRODUCTION 


A principal purpose of this paper is to evaluate the performance of 
a direct-sequence Spread-Spectrum Multiple-Access (SSMA) system 
using Binary Phase Shift Keying (BPSK) modulation for Indoor 
Wireless Communications (IWC). 

In the past decade there has been increased interest in a class of 
multiple-access techniques known as Code Division Multiple Access 
(CDMA) in which the mode of access is due primarily to coding by 
spread-spectrum sequences. In contrast with traditional time- and 
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frequency-division multiple access, these techniques do not require 
accurate time or frequency coordination among the transmitters in 
the system. This feature makes CDMA very attractive for IWC appli- 
cations, because, as described later, an IWC takes place in a severe 
multipath fading environment. However, the cost of the ease of access 
is the increased channel bandwidth required by spread-spectrum codes. 
The bandwidth spreading leads to a low spectral energy density level, 
which is an advantage in military communications in hostile environ- 
ments. It also offers the possibility of reusing overcrowded radio 
frequency bands, as recently studied by the Federal Communications 
Commission.’ 

SSMA is a common form of CDMA in which every user is assigned 
a code sequence modulated onto a carrier signal according to the user’s 
digital information. A high-rate code, that is, many code chips per 
data bit, spreads the bandwidth of the information signal. Frequency- 
hopped?“ and phase-modulated SSMA° are two forms of SSMA. The 
latter, also known as direct-sequence spread-spectrum multiple access, 
is what we concern ourselves with in this work, for its multiple-access 
capability and ease of implementation. 

Since in direct-sequence SSMA the entire channel bandwidth is 
available to all users of the system at all times, the code sequence 
applied by each user in spreading the information band must have low 
cross-correlation properties in order to achieve a low level of mutual 
interference among the users. Several classes of code sequences suit- 
able for this application have been presented by Sarwate and Pursely.® 
A class of these codes that are employed in our work is the well-known 
Gold sequences, which possess the low cross-correlation properties 
required in multiple-access applications. 

The chief purpose of this paper is to assess the communication 
performance of a direct-sequence SSMA system in the presence of 
multipath fading that is a characteristic of IWC. Our criterion of merit 
is average probability of error as a function of signal-to-noise ratio. 

There is a sizable literature relating to the effects of multiple-access 
interference on the performance of a direct-sequence SSMA, among 
which are Refs. 7 through 12. All of these consider the communication 
channel to be a single path with no fading. In IWC applications, 
because of the existence of many reflectors and scatterers, multipath 
fading is severe. Preliminary impulse response measurements inside 
office buildings indicate the severity of multipath fades.’*"* Hence, the 
attempt in this work is to incorporate multipath fading effects in the 
analysis of average probability of error of direct-sequence SSMA. 
Among the limited number of articles relating to the effects of multi- 
path fading on the performance of direct-sequence SSMA is the work 
by G. Turin’ that examines the structure of a number of receivers for 
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mobile digital radio. However, the computer simulation results in Ref. 
15 are restricted to the behavior of a single transmitter-receiver pair, 
and therefore, no multiple-access interference is taken into account. 
References 16 through 18 consider fading links, although Ref. 17 
specifies single-tone, rather than multiple-access, interference. How- 
ever, almost all studies have used measures other than average error 
rate in their evaluation. Reference 18 presents a simplified analysis 
by invoking the Gaussian assumption for the composite multiple- 
access interference distribution previously addressed in papers by Yao” 
and Mazo.’° We avoid any argument about the validity and range of 
accuracy of the Gaussian assumption. In this work, unlike in Ref. 18, 
we make no assumption about the distribution of the composite 
multiple-access interference. By employing moment-generating tech- 
niques, we evaluate the average probability of error in the presence of 
multipath fading. In this evaluation we apply the numerical Gauss 
quadrature integration.’ Specifically, our work extends the work in 
Ref. 12 to channels with multipath fading. 

In Section II we begin with a description of the SSMA system and 
the channel model. We then derive the conditional error probability. 
Evaluation of average error probability by the moment-generating 
approach is described in Section IJI. Numerical results are discussed 
in Section IV. Finally, a summary and conclusions are presented in 
the last section of the paper. 


Il. THE MODEL AND THEORETICAL DEVELOPMENTS 
2.1 System model 


Consider the direct-sequence SSMA transmission system model for 
K users depicted in Fig. 1. The kth user’s information signal b;(t) is a 
sequence of rectangular pulses taking on values from the set {+1} over 
a T-seconds time interval. Hence, 


b(t) = Y bf Pr(t — jT), (1) 
je—o 
where bi represents the kth user data at the jth timing interval and 
Pr(-) is a rectangular waveform of T-seconds duration. The kth user 
is assigned a code waveform a,(t) that consists of a periodic sequence 
of rectangular chips taking on values from the set {+1} each of duration 
T. seconds. If a? represents the ith-chip value of the kth user, then, 


a(t) = ¥ afPp(t - iT.). (2) 


i=—o 


We assume each user code sequence has a period of N = T/T,. That 
is, there is one period of code sequence per data bit. 
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After spreading the information bandwidth to N times its original 
value, by modulo-2 adding the direct-sequence code to the data sig- 
nal and biphase modulating the result onto the carrier signal, 
Acos(w,t + 6.)— where A is the carrier level, w, is the nominal carrier 
frequency, and 6; is the carrier phase that is assumed to be uniformly 
distributed between 0 and 27—the transmitted signal of the kth user 
becomes 


S;(t) = Aa, (t)b;,(t)cos(wet + Oz), k= 1, 2 18% yy K. (3) 


2.2 Transmission channel model 


In spread-spectrum transmission over multipath fading channels, if 
the spread bandwidth of the transmitted signal exceeds the coherence 
bandwidth of the channel, where the latter is proportional to the 
inverse of the channel maximum multipath delay spread, the multipath 
components can be resolved into a discrete number of Rayleigh faded 
paths. The number of resolved paths depends on the channel multipath 
spread and the spreading bandwidth of the signal, as discussed by 
Proakis.”° We assume that the IWC channel for the desired transmitter 
and receiver (k = 1) depicted in Fig. 1 can be represented by an L- 
paths Rayleigh fading model where a single transmitted pulse is 
received via L-paths at the random time instants t;,/=1,---,L. We 
assume ¢t; is uniformly distributed over one bit period, i.e., over (0, 7). 
This is ensured by signaling at baseband at a rate less than the channel 
coherence bandwidth. Hence, intersymbol interference is negligible 


Acos(Wet + 6) 


a4 (t) n{t) ay (t) 






cos(Wt + P;) 
\ 


THRESHOLD 
DETECTOR 


Acos(wet + Ax) 


Fig. 1—System model. 


ax (t) 
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here. Therefore, the low-pass equivalent impulse response of the 
passband channel, h(t), can be represented by 
L 


h(t) = Y Bd(t — tie, (4) 
i=l 

where 6(-) is the Kronecker delta, 6, is a Rayleigh distributed random 
path gain, and ¢; is the random path phase, uniformly distributed 
between 0 and 27. It is further assumed that all the parameters of all 
paths are identically distributed over their specified range. These 
assumptions are also related to G. Turin’s’ description of a discrete 
multipath fading environment. As stated earlier, the L-paths model 
stems from the fact that spread-spectrum signaling with a transmitted 
signal bandwidth much wider than the coherence bandwidth of the 
multipath fading channel enables the multipath components to be 
resolved. Therefore, the channel seems to be fading-frequency selective 
to the signal. In eq. (4) all the variables are time invariant. 

To keep the analysis tractable we further assume that the kth 
interfering user of the multiple-access system is linked to the receiver 
of Fig. 1 via a single Rayleigh fading path with a uniformly distributed 
random delay 7; ranging from zero to one bit period, T. The conclusions 
will reveal that there is no loss in generality in making such an 
assumption. Since our chief aim is to show what is not possible, more 
elaborate models incorporating more noise sources could only 
strengthen our conclusions. 

Since 7, and t; are independent, the model results in a Rayleigh 
faded interfering user sequence being received asynchronously com- 
pared with the desired user sequence at the receiver in Fig. 1. In our 
formulation we specify the Rayleigh distributed path gain of the 
interfering users by V;, k = 2, --- , K. Therefore, as depicted in Fig. 
1, the received signal for the fading model described above is given by 

i 
r(t)=A ¥ Bia, (t — ti)bi(t — t)cos(wet — wety + $1 + 41) 
jet 


K 
+A > V,a,(t => Tr)bp(t = Tr)COS(Wet — WeTR + 6.) + n(t), (5) 
k=2 
where n(t) is white Gaussian noise with double-sided spectral density 
of No/2 level and 6, can be assumed to be zero with no loss of generality. 
The desired receiver is assumed to coherently recover the carrier 
phase and delay lock to the first arriving desired signal. Therefore, 
after (1) the correlation operation that collapses the wideband coded 
signal into a narrowband modulated signal and (2) the demodulation 
process, a signal sample at the receiver low-pass filter output can be 
expressed as 
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T 
f= if r(t)a;(t)cos(w.t)dt; (6) 


or, using eq. (5) we have 


AL ss 
g= 2% Bi f a(t — tb, (t — t))ai(t)cos(y;)dt 
Ak - 
+ rr Py Vi i a, (t — T)b,(t — Tx)ai(t)cos(®,)dt + n, (7) 


where 7 is a sample of the Gaussian noise with zero mean and 
(NoT)/4 variance, 7 = i = Wel; and O, = 6, —~— WeTk. 
2.3 Detection problem and average error probability 


The assumption of phase and delay locking of the receiver to the 
first desired modulated signal that is received enables one to rewrite 
eq. (7) as 


T 
= 615 f a2(t)b,(t)dt 


Az 
+ ry x Bi f ay(t — t))b\(t — tay (t)cos(y,)dt 
AK . 
+ 2 » V, f a,(t _ Tr)bp(t _ Tp)a,(t)cos(O,)dt + 7. (8) 
We notice that from eq. (1), 
b(t) = Yb} Pr(t ~ JT), (9) 


and therefore, eq. (8) can be expressed as 


L 
=p; > bo + : 2 BilbtyRia(t) + b3R1, (ti) ]cos(y) 


K 
+5 Y. VilbtsRia(rs) + O6Rxs(ra)le0s(Os) +m, (10) 
where bg represents the information bit being detected and b+, is the 
preceding bit, which, due to the channel delay spread, affects the 
detection of bj received on the first path between the desired trans- 
mitter and receiver. 
In eq. (10), 


Rit) = f ay(t — ti)a,(t)dt (11) 
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and 


T 
Rit) = ip a(t — t)a,(t)dt 
t 

are partial autocorrelation functions of the regenerated desired code 
at the receiver, that is, a,(t) with its delayed version received via the 
Ith Rayleigh faded path, namely, a,(t — t,). We note that the coded 
sequence received via the first path between the transmitter and 
receiver of Fig. 1 is fully correlated with the regenerated sequence 
a,(t), owing to the delay locking operation introduced at the receiver. 
Also, due to the asynchronous arrival of the interfering user’s code, 
eq. (10) contains partial cross correlations of the regenerated sequence, 
a,(t), and a delayed version of the interfering codes defined by 


Reilte) = f a,(t — T,)a;(t)dt (12) 


and 


T 
Realtek) = f a, (t — Tr)ay(t)dt. 
TR 

In the second term of eq. (10), if the polarity of the two adjacent data 
bits b1; and bj happens to be the same, the sum of the two partial 
autocorrelations turns into a full autocorrelation with the same polar- 
ity as bj. This could have been useful if the path phase were known. 
However, due to the random path phase, the second term in eq. (10) 
only adds to the channel noise. In general, signals delayed by amounts 
outside +7, seconds about a correlation peak in the correlation of 
a,(t) are attenuated by an amount determined by the processing gain 
of the code, that is, N = T/T,. To assess the degree of degradation 
that is due to modulated partial correlation, in a separate case, we 
postulate having off periods of a T-second period between information 
bits, which forces to zero the undesired term containing b+, in eq. (10). 
Analysis of this case is presented in Appendix A, and associated 
numerical results are given in Section 4.2. 

For now, we return to our normal transmission policy, where no off 
period is allowed between adjacent information bits. 

The objective of the detector is to compare the received sample in 
eq. (10) with a preset threshold in order to make a decision on the 
polarity of the data bit being detected, that is, b. 

Now the detector makes a wrong decision if £ is negative while bj = 
+1, or if — is positive and b> = —1. We note that during the detection 
interval of bj the other three data bits in eq. (10), namely, b2,, b%:, 
and b§, k ¥ 1, can independently take on {+1}. Hence, the conditional 
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probability of error is expressed by 


1p AT, = 

Pelh aewee = oP {0% ge bol + z) + n <0 | bp = + 
ay 
2 


sa nf als ea eqn 1}, 


where 
L 
x= De {oh 1Ai1(t) + Ry i(t:)}cos(y), 
=2 
zs Bi 1 4 
x2 = p T {62 Ryi (ti) — Rii(t))}cos(y), 
=2 
and 
= Vi k kp 
z= 2 T {b= Rei(te) + boR:1(7x)}cos(Oz). 
=2 


We can rewrite eq. (13) as 


AT AT 
Bi + (a + 2) 


— 


Pet py.x1,%,2 = ~ | erfe 


4 oV2 


AT AT 
Bi — oy ee + 2) 


oV2 


+ erfc 


where 


a eee 
erfc(u) = — { e dy 
au 


Tv 


(13) 


(14) 


(15) 


(16) 


(17) 


(18) 


and o is the standard deviation of the Gaussian noise. We notice that 
variables x; and x2 in eq. (17) are in a conjugate form and have 
identical even moments. This is because the data symbols have zero 
mean; therefore, all the odd moments of x; and x, are zero. As a result, 
all the cross-product terms in the derivation of the even moments 
become zero. It is then easy to see that the even moments of x, and x2 
are identical. Now, to evaluate the average error probability—as will 
be explained in the next section—we apply the moments of the 
interference terms along with the numerical Gauss quadrature inte- 
gration.’® It is not too difficult, then, to observe that instead of using 


eq. (17) we can use the following: 
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AT 
1 5: [Bi — (x + z)] 
= erfe ) —————————_ [,, (19) 
2 oV2 
where x in this equation can be either x, or x2. In other words, both 
eqs. (17) and (19) will result in the same average error probability 


under evaluation by moment-generating functions. 
Also, recalling 


Pe 6,22 = 








VNoT 
c= (20) 
2 
and observing the bit energy, 
A’T 
E, om 9 ’ (21) 


we can express eq. (19) as 


Peig.ne = . erfc | 2 [Bi — (x + aif (22) 


If instead of a sample value of a Rayleigh random variable in eq. (22) 
we had a constant value of do, then eq. (22) would become 


Pelz = ; erfc | \/2 [do — (x + aif (23) 


Now, in the absence of any multipath fading and multiple-access 
interference, dy) = 1, and this equation becomes 


1 Ee 
la = \/% 
5 erfc ( B) ; (24) 


which is the well-known” performance of a coherently demodulated 
BPSK signal. 

In the Rayleigh fading case the actual received signal-to-noise ratio 
is 


Ey 
= 2 
and its average is 
Ey 
— 2) 9 
yo = E{6i} No” (26) 
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where E{ -} denotes the expected value. Since (; is Rayleigh distributed, 
7 has an exponential density. Hence, 


Pejxz = f Peig,x2P (yd; (27) 
where 
1 
p(y) =— ew (28) 
Yo 


and 


1 {° | 
Pic = erfc Vy _ \/ 2 te +z) e dy, (29) 
270 Yo No 


This may be integrated by parts if we apply the following change of 


variable: 
2 EB, ; 
t= vy VB +2); 


and after some manipulations it results in 


Pass = ; {et - \/ 2 + | 


Ey 

— — (x + z)” 

sis PO ye pao 
Vy¥o + 1 yot 1 


V¥0 \/ 
-erfc - wei (x + 2) N, . (80) 


Interested readers are referred to Appendix B for a detailed derivation 
of eq. (30) (also see Ref. 21). We notice that in the absence of multiple- 
access interference and a single-path fading of the desired signal, eq. 
(30) becomes 


1 
pai fin 


v0 (31) 


V¥o + 1 


which is the ideal performance of a single-path Rayleigh fading chan- 
nel.”° 

Now using Gauss Quadrature integration’® the average probability 
of error can be obtained numerically, by averaging the conditional 
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probability of eq. (80) over the interference term, x + z. This is 
accomplished by first evaluating the N,, = 2N. + 1 moments of x + z, 
which are applied in evaluation of the N, weights and nodes of the 
Quadrature Rule.’® Hence, the average probability of error is 


Ey 
Wy erfc (- VE: 


Ey C 
V¥0 No”? Ey V0 
— ————— exp |- erfe | — § —- ——— _ !f, (32) 
Vyo + 1 Yo + 1 No Vy + 1 
where W;’s and £;s are the N, weights and nodes of the Quadrature 


Rule.!® A detailed formulation of this is given in Appendix C. By the 
same token, the average probability of error in eq. (23) becomes 


SW; ert | = (do ~ a (33) 


Ill. MOMENT-GENERATING APPROACH 


The average probability of error expression in eqs. (82) and (33) 
assumes 2N. + 1 moments of random variable (x + z), which is a 
function of independent random parameters: {;, t;, Tr, ¥1, Ox, and b*. 

Furthermore, x and z are independent and symmetrically distrib- 
uted; hence the odd moments of (x + z) are all zero. Therefore, having 
the even moments of x and z, one can determine the moments of 
(x + z) using the simple binomial rule, that is, 


E{(x + z)™} = y @ E{x'}-E{z™}, (34) 


i= 


P. = 


iIM2 


ly 
25 





iIM2 


1 
P.= ¥ 


In this section we derive the moments of z, and by similarity we 
deduce the moments of x. 


Since 

K 

o= > By, 

k=2 

where 
Ve k kp 
Ze >= T {621 Rea(re)-+ boRzi(7z)}cos ©,, 

then, 


7 : 
Elzk") = an ELV2")-E({cos @,)°"-[b4:Raa(ra) + BbRaa(ee)P"). (35) 
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To evaluate the second expectation of the right-hand side of eq. (35), 
since ©, is an independent random variable, we can deal with it 
separately. 

That is, we first can evaluate 


am 
ime 


and then find the second expectation of eq. (35) as 


fay 
H= ra E{{b®:Rea(tn) + b&Raa (ta)? 


ae 
m 1 k kp 2m 
ai rome mA [b2,Rei(te) + b6Re (te) "dre, 








where the expectation in H is over the random delay rz. 
Now we can expand the latter integral over N chips period. Hence, 





2m 
eae {Net re . 
H= Lab va a [be Rate) + b6Rea(72)]?" dre. 
n=0 nig 


We can then use the standard notations of Pursley’ to evaluate the 
correlation functions. This is accomplished by assuming rectangular 
chips and noting that for any 0 < nT, <7 S (n+ 1)T, S T, as shown 
by Pursley,’ 


ltt eae ens 


Rea(t) = An,,Te + Bny,(t — Te)’ ee! 
where 
An, = Cea(n — N) 
Bn, = Cyi(n +1—N) — Cia(n— N), 
An, =Ca(n) k=1,2,---,K 
Bry, = Cra(n + 1) — Cei(n) (37) 


where the discrete aperiodic cross-correlation term C,,(-) is related 
to chip sequences a’ and a; via 
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N-1-n 


y ga, OsnsN-1 
j=0 


N-1+n 


Cii(n) = > aa} -(N-1)sns0. (38) 
j=0 


gan 


0 else 


Therefore, H becomes 


H= 





1 m N-1 m 9 
-~——— ¥ E(5, 
T 4 n=0 r=0 ar 


(n+1)T, 
. a: [AmaTe + Bry(te — oT .)]” 


T, 


An, Te + Bay,(te — nT) PO" dre. (39) 
Notice that in deriving eq. (39), because of the even moments, b*, 
and bé are averaged to one. Now in eq. (39), by changing 7, — nT, to 


WT., H becomes 








po Tie Min) (2n) 


T 4” n=0 r=0 2r 


1 
a [A,,, + Bn, WI)" [An + 8, wpa (40) 
: 
Therefore, to determine H we have to solve 
1 
Pinrn a { [Ang a Bry, WI) -TAng, — By, WP" dW. (41) 
P 


This can be done recursively, using integration by parts, and the result 


is 
(7) 
B,,,) 1 L 
nae : OY BF aor G+) Cees) 
i+] 
(Ane tiBel (Ane Bale 
= (An)? An) (42) 


A detailed derivation of I,,,, is included in Appendix D. For H in eq. 
(40) we now have 
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H= 








T 4” n=0 2 2r 


and for E{z?”} in eq. (35), 


2 
Ete") = Cn) wD) ne m) Pare (48) 


We notice that for a Rayleigh distributed V;,,”” 


& 
Y dae m N-1 m r 








E{ V3") = 2". v(m), (44) 


where vp, = E { Vi/2} is the average strength of the Rayleigh faded path 
associated with the kth interfering user. Note that assuming equal 
average strength cochannel interferers under a fixed total interference 
power corresponds to a maximum average probability of error in digital 
communications. Therefore, the results are conservative with respect 
to this assumption. 
To find the moments of 
K 
z= > Zk» 
k=2 
we can use a three-step method prescribed in Ref. 12, where from the 
cumulants of random variable z;, Ym(zz), the moments of z are arrived 
at. This simple algorithm is outlined below. 
The first step is to find 


Mon = E{2z3"} 
and then 


m-1 


ween) = Mn+i (Zr) 4 X 2) Yr+1(Ze)-Mm-r(Zn) 


with 
yil2z) = Mi(zz) = 0. (45) 
The second step is to find 
K 
Ym(Z) = > Ym (Zz). (46) 
k=2 
The third step is to find 


—1 


Myst (Z) = Ym+i(z) + B e r+i(2)-Mn--(z) 
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with 
M,(z2) = 1 (2). (47) 


As stated earlier, we can use a similar method to find the moments 
of x. Recall that 


L 
x= ») Xi, 
I=2 
where 
n= EBL Rian) + bbRialrd}eosty) 
and that use of the technique given above to find the moments of z 


yields 
Cn) 
2m) — m ja2m 1 — G 2m 
Efxt™) = Ga EXBP"} Samet D Di ees (48) 








4 
where 
AB) A. 
aa : OG. G+ D 
(7) 
as {(An, A ae 
a —r) tit ') ma + Bn, 
t+1 
(An, a Bere a (Ana? Age (49) 
and 


E{pi"} = 2™pii- (m!), 


where po is the average strength of the lth path. Again, as stated 
earlier, equal partitioning of interferers’ strength for a fixed total 
cochannel interference power corresponds to a maximum average error 
probability. 

Having the moments of x; we can find the moments of x by a similar 
method, as described for z. Once the moments of x and z are available, 
we can use their independence property and solve for the moments of 
(x +2). 


IV. DISCUSSION OF NUMERICAL RESULTS 


In this section the average probability of error as a function of 
signal-to-noise ratio is evaluated for various channel configurations. 
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We will first discuss our spread-spectrum code-generation method, 
and then we will exhibit and discuss the behavior of the average 
probability of error. 


4.1 Code-generation method 


Pseudonoise (PN) sequence codes applied in our numerical evalua- 
tions are Gold sequences® obtained from multiplying two primitive 
polynomials, 


hi(x) = x’? +x° +1 (50) 
and 
ho(x) = x’ +22 4+ x7 +x41, (51) 


represented by octal numbers 211 and 217, respectively. Hence, the 
resulting sequence is 


h(x) =x44+x°4+ 24+ 284+ x2 + x4 +22 + x41, (52) 


represented by 41567, in octal notation. 

The number of shift register stages required to generate the codes 
from h,(x) and h2(x) is n = 7, and the codes period is N = 2” —1 = 
127. 

To find the actual codes, we used initial loadings of Ref. 23. These 
initial loadings are shown to generate a class of Gold codes known as 
Auto-Optimal with Least Sidelobe Energy (AO/LSE). In general, with 
a generator polynomial of the form 


h(x) = hox” + hyx™ + .-- thpix th, 
and for an initial loading of 
a= (Qo, Q1, +++ 5 An), 
we can use the following recursive formula to generate the codes: 
Ajtn = hyajtn-1 B -++ OB Np-1ajr1 OB hnay, J]20, (53) 


where ® stands for modulo-2 addition. Notice that in eq. (53) for 
simplicity we have represented the chips by a instead of a? as intro- 
duced in eq. (2). The generated codes have three-valued autocorrela- 
tion function sidelobes and a three-valued cross correlation taking on 
values from the set {+15, —1, —17}. 

In our numerical evaluation we used ten initial loadings.”? Hence, 
this covers generating ten periodic code sequences for a hypothetical 
community of users sharing the common channel band on a spread- 
spectrum multiple-access basis. Once the code sequences are obtained, 
we compute the partial correlation coefficients of eq. (37), which are 
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used in conjunction with eqs. (35) and (48) in finding the moments of 
x and z as described in Section 3.1. 


4.2 Numerical results 


In what follows we assume the signal communicated between the 
desired transmitter/receiver pair is received via up to ten distinguish- 
able paths, that is, L in eq. (4) is assumed to be deterministic and at 
most equal to ten. Also, unless otherwise specified, we assume the 
transmitters maintain some form of average power control so that the 
signals from different transmitters arrive at the receivers with the 
same average power. This kind of average power control in a wireless 
PBX application is not too difficult, because the users are connected 
via a star network. 

We consider two separate cases: 

Case 1—Suppose a terminal in the IWC environment can be moved 
slightly so that in the case of strong fading of the acquired path, 
another path, hopefully stronger, can be acquired and then the ter- 
minal remains stationary. In this case, if there is not much movement 
in the channel environment, one may assume {; is fixed and perhaps 
set 3; to some constant value, dy, and use eq. (33) to evaluate the error 
probability. 

Case 2—All the desired signals arriving via different paths at the 
receiver have Rayleigh distributed random gains. This is a scenario in 
which the transmitter terminals are mobile and multipath gains are 
Rayleigh with respect to geographic position of the terminals. There- 
fore we have to use eq. (32) to compute the average error probability. 

In all our computations 15 moments of (x + z) were found to be 
quite adequate in accurately computing the average error probability. 
All the average path gains between the desired transmitter and re- 
ceiver, fo,S, were assumed to be equal. This assumption also applies 
to the average path gain of the links between the K — 1 interfering 
transmitters and the receiver, o,’s. As stated earlier, this assumption 
will result in conservative average error probability values for a fixed 
total interference power. 

Figure 2 depicts the average error probability as a function of both 
average faded and unfaded signal-to-noise ratio corresponding to Case 
1. In the same figure, performance of an ideal coherent BPSK demod- 
ulator is shown. In Fig. 2 we observe two sets of results of eq. (33) 
corresponding to two different values of do. Note that all the interferers 
are Rayleigh faded with a hypothetical average strength of —14 dB. 
For do = 1, that is, when the desired signal is 14 dB stronger than each 
interfering signal, the solid curves in Fig. 2 exhibit the performance. 
The difference in the average strength can be provided by “capture.” 
That is, it can be due to a shorter distance or a higher transmitted 
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do =-14 dB 
£{B7}=-14 dB, i #1 
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Fig. 2—Error rate performance for a fixed acquired path gain and Rayleigh 
interferers. 


power. As can be observed, the multiple-access interference for at least 
up to 10 active users can be tolerated, and at an average error 
probability of 107’° only about 2-dB signal-to-noise ratio degradation 
is experienced relative to the ideal situation. Therefore the receiver 
offers an acceptable performance as long as it is operated with capture. 
Next we demonstrate the performance for when the desired signal is 
also 14-dB faded, as the interferers are. This is shown by the dashed 
curve in Fig. 2. As observed, the performance in this scenario is 
unacceptable. 
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_ Fig.3—Error rate performance for Rayleigh acquired path gain and Rayleigh 
interferers. 


We now consider Case 2, where the terminals are to be mobile and 
the receiver is to cope with the first received Rayleigh faded path it 
acquires. The average probability of error as a function of faded and 
unfaded signal-to-noise ratio is depicted in Fig. 3. Again the hypo- 
thetical average path strength on all the Rayleigh faded paths was 
taken to be —14 dB. As can be observed, a simple correlation receiver 
that is not equipped with any diversity means or error correction 
capability exhibits a poor performance for the Gold sequences adopted 
in this work. Needless to say, the ideal performance of such a receiver 
in the absence of any multiple-access interference—but with a single 
Rayleigh faded path—is poor to begin with, as depicted in Fig. 3. 
Comparing the dashed curve of Fig. 2 and the curve in Fig. 3 corre- 
sponding to the same parameters reveals that the performance in the 
latter case is much worse than the former because of the Rayleigh 
gain of the acquired path, as expected. In a multiple-access environ- 
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ment when a transmitter and a receiver are communicating, as soon 
as a second transmitted signal comes on the air, the Rayleigh faded 
path between the interfering transmitter and the desired receiver can 
be stronger than the one between this receiver and the desired trans- 
mitter. This creates a near-far situation owing to the Rayleigh fading 
channel model. We notice that the degradation between the case of 
having only 2 or 10 active users is insignificant in this case, since the 
initial jump in error probability is large with just two users. Such a 
large jump, as will be seen later, is due to the insufficient processing 
gain provided by the N = 127 period codes for a Rayleigh channel. To 
improve this situation, longer sequences and/or diversity means are 
desired. The aforementioned numerical results assume L = 10 fading 
paths of equal average strength between the desired transmitter and 
receiver. Evidently, the finite cross correlation among the codes, 
although small in magnitude, can cause cochannel interference limi- 
tation due to the Rayleigh fading nature of the environment. There- 
fore, as the thermal noise tends to zero, the average error probability 
saturates to an unacceptable value. To gain some insight into this 
problem, we consider the following example. 

Assume that we have a system of two users where there is a single 
Rayleigh faded path between the desired transmitter/receiver pair and 
that there is also a single Rayleigh faded path between the interfering 
transmitter and the desired receiver. A sample of the received signal 
after correlation and filtering is 


AT AT ‘3 
B= Bp bb + 7 [b21Roi(r) + b3Re1(r)Jcos(®) +, (54) 


2 
where 6 and v are Rayleigh gains of the desired and interfering paths, 
7 is the relative uniform delay experienced by the interferer, and 0 is 
the relative interferer path phase uniformly distributed over 0 and 27. 
Denote 


u = = [b2,Rei(r) + b3Re1(7)}cos(0). (55) 


mle 


From eq. (30) the average error probability conditioned on u and v is 
_ Be ay 
1 \/® Yo No 
Pejuv = =’) erfe | —uv — | — ——— exp |————_ 
2 No V¥0 + 1 Yo + 1 


-erfc pee LE aa , (56) 
Vyo + 1 No 
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where 


Ey 
— 2 — 
Yo E{p } No’ 
Let 
E, 
7 
Yeu: oe 
Then, 


Yo = Blu?) Blo) 
We assume that v and 6 have the same mean-square value, that is, 
Yo = Elv*} 

and 

0 = yok{u*}. (57) 
Denote 

e= E{u*}, 
and average the conditional error probability of eq. (56) with respect 
to v. That is, evaluate 


1 foe) 
P, aor i, P, upe ody, (58) 
Yo Yo 


This amounts to 


a VY - Vyo0(¥o + 1) 
2 V1 + Cv evo + Yo + 1 
Volvo + 1) ; - EYo | (59) 
2(e’vo + yo + 1) Ve'yo + eyo + Yo + 1 
Notice that when « = 0, that is, when there is no interferer, this is the 


standard Rayleigh faded channel performance. 
If we average eq. (59) with respect to 0 and 7z and let yo — ©, we get 


Pai= 


bol] = 


N-1 
a P.= 12N3 x [(Cio(n + 1 — N) + Ci2(n — N) 


+ Ci2(n +1- N)Ci2(n aml N) + Ci2(n + 1) 
+ Cie(n) + Cie(n + 1I)Ci2(n)], (60) 
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where C,2(-) represents the aperiodic cross correlation of eq. (38). 
The sum in eq. (60) has been approximated by Pursely.’” When we use 
his approximation the saturation level of the probability of error in 
the absence of thermal noise is 


. 1 

nee EN 
where N is the period of Gold sequences applied here. Therefore, as 
observed earlier, we can reduce the saturation level by increasing N. 
In Fig. 4 the average probability of error is depicted for the case in 
this example. As one can observe, the moment approach yields the 
same saturation level in the probability of error as predicted by eq. 
(61). 

Finally, as discussed earlier in Section 2.3, we consider the case of 
having an off period of a T-second period between the adjacent 
information intervals in order to avoid partial correlation interference 
from an adjacent bit. In terms of efficiency this is obviously equivalent 
to reducing the data rate by a factor of 2. The formulation for this 
case is in Appendix A, and the results for Case 2, where there is no 
unfaded path available between the transmitter and the receiver, is 
depicted in Fig. 5. As observed, the return in performance is negligible 
compared with the results in Fig. 3. To be more specific, the average 


(61) 
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Fig. 4—Error rate performance of two-users system example. 
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Fig. 5—Error rate performance without adjacent bit overlap. 


error probability at large signal-to-noise ratios improves almost by a 
factor of 2. This improvement can intuitively be predicted considering 
that the cause, that is, partial correlation interference, is only due to 
an adjacent bit overlap. 

Again the average path strength of all the Rayleigh faded paths in 
Figs. 4 and 5 was assumed to be —14 dB. 

Therefore, without some form of diversity or error-correction coding, 
the situation as described in Case 2 is hopeless. Of course, other 
multiple-access fading environments would suffer similar penalties in 
performance if the access orthogonality could not be maintained. For 
example, in Frequency-Division Multiple Access (FDMA), any spec- 
tral overlap caused by imperfect filtering of adjacent frequency slots 
can create a similar situation. The same can be said about Time- 
Division Multiple Access (TDMA), if burst modems used in this 
application introduce any interburst interference. Consequently, re- 
gardless of the mode of access, to overcome the Rayleigh fading in 
IWC applications a diversity of some form seems necessary. 


V. SUMMARY AND CONCLUSIONS 


Current work reported herein extends previous results’ ’® in the 
following respects. Analysis of the average error probability for Direct- 
Sequence Spread-Spectrum Multiple Access (DS-SSMA) is extended 
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to include the effects of multipath fading, typically experienced in an 
IWC environment. 

For spread-spectrum transmission the IWC environment may be 
modeled by a discrete number of resolved paths with each path having 
a Rayleigh distributed gain, a uniformly distributed phase, and a 
uniformly distributed delay that can vary from zero to one information 
symbol period. The latter assumption is made to ensure having a 
negligible amount of intersymbol interference. We assume a coherent 
receiver that uses no diversity information to detect the transmitted 
symbol. We use average probability of error in our performance 
evaluation. The method of moments is applied to multipath and 
multiple-access interference, and Gauss quadrature integration is used 
in the error probability evaluation. 

From our numerical work, exhibited in a sequence of graphs, we 
draw the following conclusions: 

1. If a non-Rayleigh faded path exists in an IWC environment, a 
simple receiver can operate with DS-SSMA in a capture mode with a 
graceful performance degradation caused by multiple-access interfer- 
ence. 

2. If all the discrete paths have Rayleigh gains and guaranteed low 
average error probability is expected at all times, the simple non- 
diversity coherent receiver considered in this work will not be able to 
cope with the Rayleigh channel fading with spread-spectrum codes of 
period N = 127. Therefore some form of diversity seems absolutely 
necessary. Otherwise, very long code sequences are needed to decrease 
the error probability of the interference-limited system. 

3. The results of this work indicate that in the absence of diversity 
even small amounts of multiple-access interference can be harmful in 
a Rayleigh fading IWC environment. Therefore, if a channelized access 
such as frequency-division or time-division multiple access is to be 
employed, then careful channelization, that is, tight filtering in the 
case of FDMA and isolated transmitted bursts in the TDMA case, is 
necessary to maintain a low probability of error, given that the 
synchronization problem of a channelized access in a multipath envi- 
ronment can be solved. 
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APPENDIX A 
Performance Evaluation in the Absence of Partial Correlation 


Consider the case of having T-seconds off periods between adjacent 
information bits. The decision variable of eq. (10) is changed to 
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L 
E= p; “ bo + = Py ™ Ri 1(t))cos(y) 
K V, 
Aly a bER.1(7,)cos(®,) + 7. (62) 
k=2 
So now 
Pipe 
c=) 7 Ria (ti)cos(y) (63) 
l=2 
and 
K V, “ 
= >» ap Rna(tx)cos(®,). (64) 
k=2 


To find the moments of x and z we follow a method similar to the 


one in Section III: 
a) 
(n+1)T, 


Ty 





R27 (t)dt,, (65) 





Etat) = aaa Vf 
where 
1 =F bbs (teos(y) (66) 


and 
Ria(t:) = An,,Te + Bn, ,(ti — nT). (67) 


N11 
After proper change of variables we can define 


1 


H = T2! oa! [A,,, + Bn, x]?"dx (68) 
1 A Qm+1 A 2m+1 
= Only Bae = {[An., + Bn] = [An,,] (69) 
nh 
Hence, 
Gy 
E{xi"} = 2 “PG, (1) np amera 2 | oe 
where 
1 me “ 
Lae Im +1. oe (An, a Baye ane [An, )°""7}; (70) 


from here on the problem is identical to the one solved in Section III. 
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APPENDIX B 
Integration of Conditional Error Probability 


Denote 
= —(x + z) VE»/No (71) 
and 
I=— | erfe(vy + To)e7/dy. (72) 
270 Yo 
Letting Vy + Io = t we have 
“At -T 
T=- 1 il —2(t — To) erfc(t)e~/70(¢-To)" de, (73) 
2 Yr, Yo 
Now we integrate by parts in eq. (73) to get 
I= : erfc(I'o) — eee { ° e7/r0)(t-To)" 9? dt. (74) 
yi va “lo 


Furthermore, 
‘oO 
& —T,)2 _— 
i) e7(/r0)(t-T 0)" 9? dt 
Tp 


= eTWrotl) { el rott/Vroel-Lo/V volvo DI ge (75) 
To 


and making the change of variable 


ee \ poe a Io 
Yo Volvo + 1) 


in the integral in eq. 75, we get 


va eT o/(vo+)) _V¥0_ erfc To ee Ue ‘ (76) 
2 Vyo +1 V yt 1 


Using this result in eq. (76) in the second term in eq. (74) gives 
v'¥0 
VYo + 1 


Tr V¥0 
“exp (- Wot :) erfc (x. Zs)} ; (77) 


which is the desired result in eq. (30). 
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APPENDIX C 
Formulation of Gauss Quadrature Rules From the Moments of (x + z) 
Denote the first N,, = 2N, + 1 moments of (x + z) by the sequence 
{un}, =0, 1, 2, --- , 2N,. In the problem at hand the random variables 
are evenly distributed. Therefore, as previously stated, the odd mo- 
ments are all zero. 
Let M = [m;;], i, 7 = 1, 2, --- 2N, + 1, be the Gram matrix of the 
system with 


Mij = Mitj-2- (78) 
Thus, 
ts S0e wa: ee Say 
0 He 0 . . e e 0 
ee re 
Me kk we bce b (79) 
WN, °° 5 + + + Ben, 


where pp = 1. Also, let M = R’R be the Cholesky decomposition of 
M, where T represents the transpose matrix with 


i-1 1/2 
ri = (ms = ») r :| (80) 


k=1 


i-1 
rij = (m, = z rata) | ra L <j. (81) 


Because all the odd moments are zero, it follows that r;; = 0 when 
(t + j) is odd. We now have an upper triangular matrix R = [r,;], 
i,j = 1,2, --- No +1. The matrix is used to calculate a set of numbers 
{6;},7=1, 2, ---, N., where 


and 


— Tj+1j+1 (82) 
: Vj 
Now we construct a tridiagonal matrix J as follows: 


0 6 O - - - #4O 


6 0 6b + + - 0 

0 b& O . .- .- 0 
ne 

: ONn,-1 

Bae 20 


1208 TECHNICAL JOURNAL, JULY-AUGUST 1985 


where J is an N, X N, matrix. 

By finding the eigenvalues and eigenvectors of the matrix J, it is 
possible to arrive at the weights and nodes of the quadrature rule. Let 
the eigenvalue equation be 


Iq; = iOje (84) 


Then the quadrature rule for the sequence (W,, ¢;) in eq. (82) is given 
by the set of numbers {qi;, \;}, 7 = 1, 2, --- Nc, where qi; is the square 
of the first element of the eigenvector g;, and }; is an eigenvalue in eq. 
(84). Hence, if the first 2N, + 1 moments are calculated, then the 
resulting quadrature rule will contain N, weights and nodes. 


APPENDIX D 
Evaluation of Even Moments 


Consider the integral 


= f (a + bx)”-(c + dx)”dx. (85) 


This integral can be expanded to 


I 2 E + b)"(c + aj"? _ at™e™ 
an d(m +1) d(m + 1) 


- 2 | (a + bx)""*-(c + dx)™**dx, (86) 


where we now have to solve for 


1 
i= { (a + bx)". (¢ + dx)™*"dx. (87) 
0 


If we integrate by parts in eq. (87) and substitute the result into eq. 
(86), we find that 
(j) 
1 l 
=) (-1)' sa 


i=0 i (i +1) (” +it ‘) 
L+1 
[(a + b)"-i(¢ + qymtitt — gn-igmtitty, (88) 
In the problem considered in the main body of the paper, 
a= Ap, b= B,, c=A,, and d=B,. 
Thus, Jo takes the form 
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(77) 
_ Ey gee (Bay NE 
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In this paper we extend previous work on isolated-word recognition based 
on hidden Markov models by replacing the discrete symbol representation of 
the speech signal with a continuous Gaussian mixture density. In this manner 
the inherent quantization error introduced by the discrete representation is 
essentially eliminated. The resulting recognizer was tested on a vocabulary of 
the ten digits across a wide range of talkers and test conditions and shown to 
have an error rate comparable to that of the best template recognizers and 
significantly lower than that of the discrete symbol hidden Markov model 
system. We discuss several issues involved in the training of the continuous 
density models and in the implementation of the recognizer. 


I. INTRODUCTION 


In the literature a wide variety of approaches have been proposed 
to recognize isolated words, based on standard statistical-pattern- 
recognition techniques.’ The most successful of these has been the 
template-based recognizer approach, which uses Dynamic Program- 
ming (DP) as the method for comparing patterns. Although the 
template-based approach using DP has been very successful, alterna- 
tive recognition strategies have been studied because of 
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1. The high computational cost of the DP approach; 

2. The difficulties in extending the DP recognition paradigm to 
more difficult problems—e.g., connected words, continuous speech; 

3. The desire to use a robust parametric model, rather than the 
nonparametric template, to represent the speech; 

4. The desire to use speech units other than words in some circum- 
stances—e.g., syllables, demisyllables, phonemes. 

For one or more of the above reasons, several different approaches 
have been proposed, such as using Vector Quantization (VQ) in the 
DP computation,* using word-based vector quantization to eliminate 
the DP processing,’ using VQ as a front-end preprocessor,’ and using 
Hidden Markov Models (HMMs) to represent the speech signal.°*"™ 
Although the VQ-based recognizers have performed very well in iso- 
lated-word recognition tasks, and have significantly reduced the com- 
putational costs, they have done very little to alleviate the difficulties 
in extending template-based approaches to large vocabulary connected 
and continuous speech recognition applications. As such, the HMM 
recognizer has been and will continue to be of great interest both 
because of its potential low cost, and because it is a parametric model 
of the speech signal that can model various events (phonemes, sylla- 
bles, etc.) in the speech signal. 

Although HMMs have been used in a wide variety of speech sys- 
tems,°*" our experience with their application to speech recognition 
systems has been considerably less than with that of template-based 
approaches. Hence each new experiment using HMMs gives us a 
better understanding of the strengths and weaknesses of such models 
as applied to different speech recognition tasks. In particular, in our 
own work, we have been studying how to apply HMMs in isolated- 
word, speaker-independent speech recognition applications over 
dialed-up telephone lines. In a previous investigation,® we studied 
HMMs based on observations consisting of discrete symbols from a 
finite alphabet (i.e., vector-quantized LPC vectors from a fixed-size 
code book). Work performed at IBM,’? CMU,? and more recently at 
Phillips’? has used continuous HMMs where it was assumed that all 
parameters of interest had Gaussian distributions. 

The HMMs to be discussed in this paper are based on continuous, 
mixture density models of the distribution of Linear Predictive Coef- 
ficient (LPC)-derived parameter vectors (e.g., cepstral vectors, log- 
area ratio vectors, etc.). We have devised training procedures for 
obtaining maximum-likelihood estimates of the parameters of the 
mixture distribution. We have applied the models to the problem of 
recognizing isolated digits. Our results show that the average error 
rates of such HMM recognizers are essentially identical to those of 
the best template approaches using DP methods, and considerably 
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lower than those of an HMM recognizer with a discrete symbol VQ 
front end. 

This paper is organized as follows. In Section II we present the 
continuous mixture density model. We show how we obtain the max- 
imum-likelihood estimates of the model parameters from a training 
set of data, and how the overall recognition system is implemented. In 
Section III we describe a series of experimental evolutions of the 
recognizer and present results on HMM systems with several different 
sets of parameters. In Section IV we discuss the results and relate 
them to earlier work with template-based approaches. We also discuss 
computational aspects of the system in this section. Finally, in Section 
V we summarize our results. 


Il. THE CONTINUOUS MIXTURE DENSITY HMM 


Figure 1 shows the type of HMM we are considering here. It is 
based upon a left-to-right Markov chain that starts in state 1 and ends 
in state N. The observed signal is assumed to be a stochastic function 
of the state sequence of the Markov chain. The state sequence itself 
is unobservable (hidden). The goal is to choose the parameters of the 
HMM to optimally match the observed characteristics of a given 
signal. 

The parameters that characterize the HMM of Fig. 1 are 

1. N, the number of states in the model. . 

2. A= [ay], 1<1,j < N, the state transition matrix, where a; is the 
probability of making a transition from state i to state 7. As shown in 
Fig. 1b, for left-to-right models we use the constraint a; = 0,j <i,j > 
i+ 2. 

3. B, the observation probability function. 

If we assume that the signal to be represented by the HMM consists 
of a sequence of observation vectors O = {O,, O2,--- , Or}, where each 
O; is a vector that characterizes the signal at time ¢t = 1, then we can 
consider two types of observation probability functions, namely, dis- 
crete and continuous. For the discrete type we replace O; by one of M 





a a a 
STATE 13 24 35 
DENSITY: 

by (x) by (x) b3(x) by (x) bg (x) 


Fig. 1—Representation of a left-to-right hidden Markov model with five states. 
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possible symbols (via some type of VQ) such that the distortion 
in quantizing O; is minimum. Let j be the state at time t. Then B = 
[be], 1 $j = N,1<k SM is the probability of observing symbol k, 
in state j. 

In the continuous case we have the probability density function 
B = {b;(x)}, 1 s j s N, where 0; (x)dx is the probability that the 
vector O, lies between x and x + dx. The types of density functions 
allowed for b;(x), for which a reestimation algorithm exists, include 
strictly log-concave densities,‘ elliptically symmetric densities, and, 
more recently, mixtures of strictly log-concave or elliptically symme- 
tric densities.’° In this paper we will consider Gaussian mixture 
densities of the form 


M 
bj(x) = 2 Cir (X, bye, Ujz), (1) 
aa 


where _/ (x, u, U) denotes a D-dimensional normal density function 
of mean vector u and covariance matrix U. 

To summarize the discussion above, a complete specification of a 
continuous mixture density HMM requires choosing values (and/or 
parameter estimates) for the following: 


N—number of states in the model 

M—number of mixtures 

D—number of dimensions in each vector 

A=[a,;]—state transition matrix 

C=[c;,]|—mixture gain matrix 

= [pjxa]—means of the mixture components 
U=[Uyzae]|—covariance matrices of the mixture components. 


For the work to be presented here, we have chosen N = 5 states on 
the basis of previous studies with discrete symbol models.® Also, our 
signal observation vectors (e.g., cepstral vectors, log-area ratios, etc.) 
are derived from the LPC vector of an eighth-order model of the 
speech signal. 


2.1 Training the HAM 


- For each word, v, in a vocabulary of V words ( V = 10 for the digits), 
an HMM is designed; i.e., the set of parameters above is estimated 
from a training set of data representing multiple occurrences of the 
vocabulary word by a wide range of talkers. Since a convergent 
reestimation procedure exists for the continuous mixture model,’® it 
is, in theory, possible to randomly choose initial values for each of the 
model parameters (subject to the stochastic constraints) and let the 
reestimation procedure determine the optimum (maximum-likelihood) 
values. However, experience with the reestimation procedure!” has 
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MODEL 
INITIALIZATION 


STATE SEQUENCE 
SEGMENTATION 


ESTIMATE PARAMETERS 

Ble} 

VIA SEGMENTAL 
K-MEANS 


MODEL 
REESTIMATION 


Fig. 2—The training procedure used to estimate parameter values for the optimal 
continuous mixture density fit to a finite number of observation sequences. 
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TRAINING MODEL 
PARAMETERS 


shown that the maximum-likelihood estimates of the means, p, are 
quite sensitive to the initial estimate. Hence a procedure for providing 
good initial estimates of u for each mixture and each state was required. 

Based on previous experience with a K-means iterative procedure 
for clustering data,'® a procedure for obtaining model parameter esti- 
mates was devised and is shown in Fig. 2. (The analysis used to give 
the LPC-derived vectors is reviewed in Section 2.2.) We assume a 
training set of data consisting of Q sequences of observations, where 
each sequence, O' = {0}, 03, --- , O7}, 1 <i <Q, is the set of vectors 
(observations) constituting a single occurrence of the word. The total 
observation vector is O = {0'O0? ..- O°}. The first step in the training 
procedure is to choose an initial model estimate. This initial estimate 
(unlike the one required for reestimation) can be chosen randomly, or 
on the basis of any good initial guess. (The procedure to be described 
here works well for a wide range of initial guesses.) 

We denote the N states in the HMM as q;, 1 s 1 Ss N. The second 
step in the training procedure is to segment each word occurrence, O', 
into states based on the current model, \. This segmentation is 
achieved by finding the optimum state sequence, via the Viterbi 
algorithm, and then backtracking along the optimal path. This pro- 
cedure is illustrated in Fig. 3, which shows a log-energy plot, an 
accumulated log-likelihood plot, and a state segmentation for one 
occurrence of the digit six. Figure 3 shows that the states correspond 
roughly to the sounds in the word six. 

The result of segmenting each of the Q training sequences is, for 
each of the N states, a set of the observations that occur within each 
state g; according to the current model. Since the assumed distribution 
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LOG ENERGY 


STATE 
w 


> Locf 
A: 
oa 
oO oO 
| 
| 


1 = 
1b, by bz b, b, = 49=T 
FRAME NUMBER 


Fig. 3—Plots of (a) log energy, (b) accumulated log likelihood, and (c) state assign- 
ment for one occurrence of the word six. 


of the observations, within the jth state, is b;(x), a comparison can be 
made of the marginal distributions b;(x) | x=...x,...; against a histogram 
of the actual observations (i.e., vectors assigned to that state). Such a 
comparison is given in Fig. 4 for a D = 9 dimensional representation 
with M = 5 mixtures. (The covariance matrices are assumed to be 
diagonal in this example.) The nine dimensions consist of the eight 
dimensions of a cepstral representation, and the normalized log energy 
as the ninth parameter. The results in Fig. 4 are for the first state of 
the digit 0. The need for values of M > 1 is seen in the histogram of 
the first parameter (the first cepstral component), which is inherently 
multimodal; similarly, the second, fourth, and eighth cepstral param- 
eters show the need for more than a single Gaussian to provide good 
fits. Many of the other parameters appear to be well fitted by a single 
Gaussian curve; however, in some cases even M = 5 mixtures do not 
provide a very good fit. 

Following the segmentation into states of all Q training sequences, 
a segmental K-means procedure is used to cluster the vectors in each 
state, g;, into a set of M clusters (to do this we use a Euclidean 
distortion metric and a VQ design algorithm). From the clustering, an 
updated set of model parameters is derived as follows: 


Gj = Number of vectors classified in cluster k of the jth state 
divided by the number of vectors in state j 

Ljka = dth component of the sample mean of the vectors classified 
in cluster k of state] 
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Fig. 4—Comparison of estimated density (jagged contour) and model density (smooth 
contour) for each of the nine components of the representation vector (eight cepstral 
components, one log-energy component) for state 1 of the digit zero. 


UO sters =(r, s)th component of the sample covariance matrix of the 
vectors classified in cluster k of state j. 


The state transition matrix coefficients, aj, are not changed according 
to this procedure. The new model, \= (A, B, zt LL, U), is obtained from 
the updated estimates B, ji, and U, and the original A matrix. At this 
point the formal reestimation procedure is used to reestimate optimal 
values (in a maximum-likelihood sense) of all model parameters. The 
resulting model is then compared to the previous model (by computing 
a distance score that reflects the statistical similarity of the HMMs”). 
If the model distance score exceeds a threshold, then the old model, ), 
is replaced by the new model, ) (the result of reestimation), and the 
overall training loop is repeated. If the model distance score falls below 
the threshold, then model convergence is assumed and the final model 
parameters are saved. 

As an alternative to using the sample means, jjza, and sample 
covariance matrix, Oyters (which are the maximum-likelihood estimates 
for a Gaussian distribution), we also investigated a method of fitting 
a single Gaussian distribution to an observed histogram within each 
cluster of each state. For the case when U is diagonal, a histogram 
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with NB bins is made for each component of the vector, and the model 
parameters (i.e., 4 and a) are chosen so as to minimize the cost function 
2 (hi — hi)? 
le = ie 
where h; is the observed frequency of occurrence in the ith bin, and h; 
is the corresponding model estimate for that bin. The minimization 
for the two-parameter case (i.e., «1 and o) can be trivially carried out 
by several different procedures. 

For the case when U is a full covariance matrix, the histogram- 
fitting procedure could, in principle, be extended to D-dimensional 
histograms with correlated components. However, the amount of 
training data available was insufficient for the number of parameters 
being fitted. Instead, the histogram-fitting procedure that we used was 
as follows. The sample covariance matrix, U, was estimated from the 
training data (as above), and decomposed as 


U=T' AT, 


where A is a diagonal matrix. The original vectors, c, were transformed 
by the relation w = Tc. In this manner the components of w were 
uncorrelated with diagonal correlation matrix A; hence the histogram- 
fitting procedure, described above, could be used along each trans- 
formed dimension separately. In practice we have found that the 
transformation to uncorrelated components and the Gaussian fitting 
gave somewhat better model parameter estimates than the sample 
estimates for the full covariance case. 

Since the steps of segmenting the training sequences into states and 
clustering the vectors via a VQ clustering procedure are relatively 
inexpensive (in a computational sense), and reestimation is an exceed- 
ingly costly procedure, a practical implementation of the training 
procedure of Fig. 2 is to bypass the step of model reestimation until 
local model convergence is obtained, and then apply the reestimation 
procedure at the final step. This procedure works well in practice, 
particularly when used for left-to-right models where the sequential 
characteristics of the process are of vital importance. 


2.2 The HMM recognizer 


Once the HMMs have been trained on each vocabulary word, the 
recognition strategy is straightforward. Figure 5 shows a block diagram 
of the recognizer. The speech signal, s(n), for the unknown word is 
first analyzed using an eighth-order LPC analysis. The speech sam- 
pling rate is 6.67 kHz, and overlapping sections of 45 ms of speech are 
analyzed every 15 ms to give a set of eight LPC coefficients. An LPC 
transformation algorithm is used to convert the LPC representation 
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Fig. 5—Block diagram of the HMM recognizer based on continuous mixture densities. 
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to the desired one for the recognizer. In particular we have considered 
the following possibilities: 

1. LPC-derived cepstrum of eighth-order (the zeroth-order term is 

not used) 

. LPC-derived log-area ratios 

. Autocorrelation coefficients normalized by energy 

. Residual normalized autocorrelation coefficients 

. Autocorrelation of LPC coefficients. 

The vector representation used in training is the one used in the 
recognizer. 

The next step in recognition is to find the optimum state sequence 
corresponding to the HMM for each vocabulary word, \”, 1 su Ss V, 
and compute the log-likelihood score for the optimal path. The decision 
rule assigns the unknown word to the vocabulary word whose model 
has the highest log-likelihood score. 

The optimum path is obtained by the well-known Viterbi algo- 
rithm,”” which can be compactly stated as: 

1. Initialization—6,(1) = log[b,(O,)] 

2. Recursion —6;(j) = max {6:-1(t) + log ay} + log[b;(O,)], 

j-2sisj 


Ot ® © bo 


2=t= T, 1ls<jsN 
3. Termination —log f = 67(N). 


2.3 Incorporation of duration into the recognizer 


Inherently, each state in the HMM has a geometric duration prob- 
ability. Thus, a state j, with a probability a; of returning to itself, has 
a state duration probability of 


p(Z) = (1 — ay)az, 


where / is the number of frames occurring in state 7. Experience has 
shown that exponentials are not good models for state duration prob- 
abilities. Thus, we have considered two alternative ways of incorpo- 
rating state duration information in the recognizer, namely, modifi- 
cation of the scoring procedure to include an internal duration model, 
and application of a post-processing duration model on the maximum- 
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likelihood state sequence as determined by the Viterbi algorithm. In 
either case, in the training phase, we estimate a state duration prob- 
ability of the form 


p;(7/T) = probability of being in state j for exactly (2/T) of the 
word, where JT is the number of frames in the word and 
Z is the number of frames spent in state /. 


The quantity //T, which ranges from 0 to 1, is the normalized duration 
within a given state. For each word, and for each state, the quantity 
p;(4/T) is estimated (via a simple counting procedure on the training 
sequences) for 25 values of 7/T from 0 to 1. 

The state duration probability, [p;(Z) or p;(Z/T)], is not estimated 
as part of the training procedure, but instead is computed directly 
from the training sequences based on the models obtained from the 
training procedure. Hence the estimates of p;(7/T) are strictly heuristic 
ones, not maximum-likelihood estimates. Unfortunately, direct rees- 
timation of the maximum-likelihood estimate of p;(//T) is, at present, 
totally impractical both because of the excessive computation required, 
and because of the sparsity of training data for estimating the in- 
creased number of model parameters. 

A typical set of histograms of p;(7/T) for a five-state model for the 
word six is shown in Fig. 6. Although the states are hidden, examina- 
tion of the results of segmentation of typical utterances (of the word 
six) into states shows that the first two states are essentially the initial 
/s/, the third state is a transition to the vowel /i/, the fourth state is 
the vowel, and the fifth state is the stop and the final fricative /s/. As 
seen from Fig. 6, the average duration of the first state is generally 
very brief; the second and third states have somewhat longer average 
durations; the fourth state has a well-defined peak in the density with 
an average duration of about 20 percent of the word; the final state 
- (the stop plus the fricative) has an average duration of about 50 
percent of the word. 

For scoring a given observation sequence using the internal duration 
model, the recursion step of the Viterbi procedure is modified to the 
form 


6:(j) = max max {nti + log aj 


j-2si<sj-1 0<JTS1.0 
o 
+ a log pj(4T) + » oat, (0,--I (2) 
T=1 


Note that in eq. (2) the duration term appears only when the state 
changes. Furthermore, a multiplier factor a on the log-duration prob- 
ability is used to adjust the importance of the duration part of the 
scoring. 
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_ Fig. 6—Histograms of the normalized duration density for the five states of the digit 
six. 


The implementation of the recursion of eq. (2) is considerably more 
costly than the implementation of the standard Viterbi recursion, 
since the values 6,-(i) must be retained for a large range of 7 values, 
and since the > log[b;(O,-,)] computation must be repeatedly done 
during each iteration. In practice we have measured an increase of 15 
to 20 times in computation for the internal duration model over the 
standard Viterbi algorithm. For these reasons we have also considered 
a much simpler post-processor duration model in which the original 
Viterbi alignment is performed, the maximum-likelihood state se- 
quence is determined, and the duration of each state is obtained via a 
backtracking procedure. The post-processor then increments the log- 
likelihood score by the log-duration probabilities (suitably weighted 
again) to give: 


N 
log f = log f + a Y loglp;(4/T)], (3) 
j=l 
where 4/T is the normalized time spent in state j along the optimal 
alignment path. 


The incremental cost of the computation for the post-processor 
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duration model is essentially negligible, and we will see in Section III 
that it works as well or better than the internal duration model 
discussed above. 


ill. EXPERIMENTAL EVALUATION 


To evaluate the performance of the HMM recognizer with the 
mixture density representation, a series of experiments were run in 
which several parameters of the models were varied. All evaluations 
were performed on a database of isolated digits recorded over standard 
dialed-up telephone lines. Four sets of spoken digits were used. These 
consisted of the following: 


DIG 1—100 talkers (50 male, 50 female), one replication of each 
digit by each talker.?! These recordings have been used as a training 
set in a wide variety of evaluations of isolated-word recognizers at 
AT&T Bell Laboratories. The nominal bandwidth of these recordings 
was 100 to 3200 Hz. 

DIG 2—Same 100 talkers and recording conditions as DIG 1; re- 
cordings made several weeks later than those of DIG 1. 

DIG 3—100 new talkers (50 male, 50 female), one averaged occur- 
rence of each digit by each talker obtained from averaging a pair of 
robust tokens of the digit.2%?? The transmission conditions (i.e., analog 
front end, filter cutoff frequencies, etc.) differed slightly from those 
used in recording the DIG 1 and DIG 2 databases. 

DIG 4—A second group of 100 new talkers (50 male, 50 female), 20 
recordings of each digit by each talker.” A random sampling of one of 
the recordings of each digit by each talker was used. The transmission 
conditions differed substantially from those used in recording the 
other databases. The nominal bandwidth of these recordings was 200 
to 3200 Hz. 


Thus, each of the four sets of digits contained 1000 digits. For training 
the models, only the digits in set DIG 1 or set DIG 4 were used; for 
testing and evaluating the performance of the recognizer, each of the 
four sets of digits was used. 


3.1 Pilot recognition experiments to determine representation 


A series of pilot experiments was run to determine a good set of 
parameters for use in the HMM recognizer. The five parameter sets 
(transformations of the LPC parameters) mentioned in Section 2.2 
were studied. Results indicated that the best performance was obtained 
from the cepstral parameters; however, almost the same performance 
was obtained from the log-area ratio parameters. The remaining three 
parameter sets—i.e., energy-normalized autocorrelations, residual- 
normalized autocorrelations, and autocorrelation of LPC coeffi- 
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cients—all gave significantly poorer performance. This was due to the 
use of a Euclidean distance metric in the clustering part of the training 
procedure. Each of the poor representations had the property that one 
or more of the coefficients in the vector contributed significantly more 
variance in the distance calculation than the remaining coefficients; 
hence a large sensitivity to the details of the training set resulted, and 
very poor estimates of the means and covariances of the parameters 
were obtained. Such problems could have been alleviated by replacing 
the Euclidean distance metric with a covariance weighted metric; 
however, we did not do this because of the greatly increased compu- 
tational burden. 

As a result of the pilot experiment, the parameter set chosen was 
an eighth-order cepstral vector with the option of appending a peak- 
normalized log energy as a ninth component of the vector. 


3.2 Diagonal versus full covariance matrices 


Two forms for the U matrices of eq. (1) were considered, namely, 
diagonal matrices (with assumed zero correlation between components 
of the representation), and full covariance matrices. The advantage of 
the diagonal covariance matrix is that the computation of b;(x) reduces 
to asimple sum of products of Gaussians, whereas for a full covariance 
matrix the computation of 5;(x) requires a matrix multiply. The 
disadvantage of the diagonal covariance matrix representation is that, 
in general, for correlated vector components, a larger value of M (the 
number of mixtures) is needed to give an adequate model than for a 
full covariance matrix representation. Neither representation has any 
particular advantage in terms of ease of making initial estimates or 
ease of reestimation. 

A series of recognition tests was run with diagonal covariance 
matrices using M = 1, 3, and 5, and full covariance matrices using 
M = 1 only. The results showed that performance with the full covar- 
iance matrix with M = 1 was better than that obtained using only the 
diagonal covariance matrix with M = 1 and 3, and comparable to the 
performance with M = 5. Hence, in all subsequent recognition tests 
we will consider both diagonal and full covariance models. 


3.3 Applicability of word clustering to model generation 


In the model training procedure all 100 tokens of each word were 
used to derive a single HMM for the word. In earlier work, with 
template-based approaches,” it was shown how word clustering tech- 
niques could be used to design a set of templates to represent a broad 
population of talkers. Thus, one question of interest was whether the 
word clustering procedure could be combined with the model genera- 
tion technique to give more than one HMM per word with better 
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performance than the single HMM system. This idea was tested as 
follows. First, a single HMM per word was created on training set 
DIG 1; next the two-cluster-per-word template set was used to parti- 
tion the 100-token training set into two groups. For each group a 
single HMM was created; hence a total of two HMMs per word was 
used in the performance evaluation. The potential disadvantage of 
this procedure should be clear, namely, that the training data per 
model available for estimating HMM parameters is half that used for 
the single model case. Hence there is a good possibility of obtaining 
less reliable estimates of the model parameters. 

This procedure could be continued as above for three or more 
template solutions; however, experience indicated that a two-model 
solution was about the limit for 100 training tokens. Beyond this point 
the unreliability of the estimates was the dominant factor. 

Results of a formal series of experiments with each of the four test 
sets and with one and two models per word are given in Table I, which 
shows average digit error rates for both diagonal covariance models 
(part a), and full covariance models (part b) using normalized log- 
energy and cepstral coefficients, and with the post-processor duration 
model. For the diagonal covariance models, the results show that 
for the reference set (DIG 1), the average error rate was essentially 0 
for both one- and two-model-per-word systems. For the test set DIG 2 
the average error rate for the two-model-per-word system was slightly 
smaller (by 0.4 percent) than for the one-model-per-word system. For 
the test sets DIG 3 and DIG 4, the average error rate for the two- 
model-per-word system was 1 percent smaller than for the one-model- 
per-word system. Overall, averaged across the three independent test 
sets, the two-model-per-word system had a 0.7 percent smaller error 
rate than the one-template-per-word system. This difference, although 
small, is significant at the 95-percent level for a test with 4000 digits. 


Table I—Comparison of performance of HMM recognizer 
with one and two models per word* 
mie ; 
Number of Average Digit Error Rate (%) 


Models per Test Set 
Word DIG 1 DIG 2 DIG 3 DIG 4 Average 


(a) Diagonal covariance models, M = 5 mixtures 


1 0.2 1.1 3.9 5.2 3.4 
2 0.1 0.7 2.8 4.2 2.67 
(b) Full covariance models, M = 1 mixture 
1 0.2 0.9 2.9 4.7 2.83 
2 0.2 0.6 2.2 4,7 2.5 
* Both energy and duration were used in the evaluation. The training set 
was DIG 1. 
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For the full covariance models, the improvement in performance in 
going from one to two models per digit was smaller than that obtained 
in the diagonal covariance case. On average, the improvement in error 
rate was only 0.33 percent, and for two of the four sets (the training 
set DIG 1, and the testing set DIG 4) there was no improvement in 
performance with two models per digit. Thus, for the full covariance 
models a single model per word was adequate for the data. 


3.4 Effects of different number of mixtures 


Using the two-model-per-word system for the diagonal covariance 
case, the number of mixtures, M, was varied from 1 to 7, in steps of 
two, to see the effects on recognition performance. The results of these 
tests on the four-digit databases are given in Table IIa. The results 
show an improvement in performance from an average test set digit 
error rate of 3.57 percent for M = 1 down to an average test set digit 
error rate of 2.57 percent for M = 5; results for M = 7 show a slight 
increase in average test set digit error rate to 2.97 percent. This 
increase in error rate for the M = 7 case is primarily due to a 0.8- 
percent increase in error rate for test set DIG 4. This result seems to 
indicate that no real improvement in modeling the statistics of the 
observations is obtained with M = 7; instead, a somewhat broader 
range for fitting incorrect words is achieved, thereby raising the error 
rate on DIG 4. Based on the results of Table II, a value of M = 5 was 
deemed most appropriate for the recognizer. 

Another test was run for the diagonal covariance case, in which the 
value of M was made variable with each state of the model. The chosen 
value was based on the average distortion in the initial modeling 
section of the training loop. Thus, large values of M (on the order of 


Table !I—Comparison of performance of HMM 
recognizer with different values of M* 


Average Digit Error Rate (%) 


Test Set 
M DIG 1 DIG 2 DIG 3 DIG 4 Average 
(a) Results on diagonal covariance models for two models per digit 
1 1.1 1.3 3.2 6.2 3.57 
3 0.2 1.1 4.1 5.2 3.47 
5 0.1 0.7 2.8 4.2 2.57 
7 0.0 0.8 3.1 5.0 2.97 
(b) Results on full covariance models for one model per digit 
1 0.2 0.9 2.9 4.7 2.83 
2 0.0 1.2 6.0 5.3 4.17 
* Both energy and duration were used in the evaluation. Training set 


was DIG 1. 
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10 to 15) were required for some states, whereas very small values of 
M (1 to 2) were required for others. Recognition tests using the variable 
M models gave very poor results (i.e., error rates considerably higher 
than those for fixed M models). An analysis of the errors showed a 
greater increase in the likelihood for incorrect models than for the 
correct model. Hence we concluded that variable M models were not 
a viable alternative for HMM recognizers. 

Using the one-model-per-word system for the full covariance case, 
a similar test was performed in which values of 1 and 2 were used for 
M. The results, listed in Table IIb, show a degradation in performance 
from an average test set digit error rate of 2.83 percent for M = 1 to 
an average test set digit error rate of 4.17 percent for M = 2. Most of 
the increase in error rate occurs for test set DIG 3, where the error 
rate increases by 3.1 percent. This result again indicates that a single 
full covariance matrix provides an adequate fit to the training data, 
and that increases in M primarily decrease the amount of training 
data per model and therefore lead to poorer parameter estimates and 
worse recognition performance. 


3.5 Effects of energy and duration 


To study the effects of including energy in the signal representation, 
and of including the duration model in the testing, a series of recog- 
nition runs were made with the M = 5, diagonal covariance, two- 
model-per-word system, and the M = 1, full covariance, two-model- 
per-word system. The results of these recognition tests are given in 
Table III. The duration model was implemented as a post-processor 
computation in all cases. 


Table !II—Comparison of performance of HMM recognizer with and 
without energy and with and without duration model 
(training set was DIG 1) 


Average Digit Error Rate (%) 


Test Set 
Condition DIG1 DIG2 DIG3 DIG4 Average 
(a) Results on diagonal covariance models, M = 5, with two models per digit 
No energy, no duration model 0.3 2.5 4.3 8.0 4,93 
Energy, no duration model 0.3 0.9 2.5 5.5 2.97 
No energy, duration model 0.1 1.3 3.3 5.4 3.33 
Energy, duration model 0.1 0.7 2.8 4.2 2.57 
(b) Results on full covariance models, M = 1, with two models per digit 
No energy, no duration model 0.2 2.0 2.8 7.0 3.93 
Energy, no duration model 0.2 1.2 2.1 4.4 2.57 
No energy, duration model 0.3 1.3 3.0 5.9 3.4 
Energy, duration model . 0.2 0.6 2:2 4.7 2.5 
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The results show clearly that the addition of either energy or 
duration uniformly improves the performance of the HMM recognizer, 
although energy is more important than duration for the full covari- 
ance models. Furthermore, the combination of both energy and dura- 
tion model yields better performance than either factor individually. 
The biggest improvements in performance were obtained for test sets 
DIG 3 and DIG 4, where the transmission characteristics of the speech 
were different from those of DIG 1 and DIG 2. In these cases the 
addition of energy and duration model makes the system more robust 
because these features are, for the most part, insensitive to differences 
in transmission conditions. 


3.6 Comparison of internal duration model and post-processor duration 
model 


The next set of experiments compared the two different implemen- 
tations of the duration model, namely, the internal duration model 
and the post-processor duration model. In both cases the same 
(suboptimal) state-duration probability density function was used, 
with a multiplier of a = 3.0. (This factor was optimized based on 
preliminary experimentation.) The results of the two runs are given 
in Table IV. ) 

The results show that the performance of the HMM recognizer with 
the post-processor duration model was uniformly slightly better than 
for the recognizer with the internal duration model. Across the three 
test sets the improvement in performance was about 0.9 percent, and 
for the two data sets with different transmission conditions (DIG 3 
and DIG 4), the improvement was 0.9 percent and 1.7 percent, respec- 
tively. In addition, the computational load was between one and two 
orders of magnitude lower for the post-processor duration model than 
for the internal duration model. Hence the results given in Table IV 
strongly justify the use of the “suboptimal” post-processor duration 
model as an alternative to using the inherent exponential distributions 
for each state. The major problem with the use of any duration model 
is the difficulty of making reliable estimates of the density function 


Table 1\V—Comparison of performance of HMM recognizer with two 
types of duration models* 


Average Digit Error Rate (%) 


Test Set 
Duration Model DIG 1 DIG2 DIG3 DIG4 _ = Average 
Internal in Viterbi search 0.2 0.9 3.7 5.9 3.5 
Post-processor 0.1 0.7 2.8 4,2 2.57 


* Results given on diagonal covariance models, M = 5, with two models per digit, and 
with energy used as a feature. Training set was DIG 1. 
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from a finite training set of observations (as is invariably the case for 
most speech processing applications). However, the improvements in 
performance obtained from using the duration model more than justify 
its use in the HMM recognizer. 


3.7 Effects of different training sets 


The last set of experiments investigated the effects of different 
training sets on the overall recognizer performance. The results of 
these experiments are given in Table V, which shows average test set 
digit error rate as a function of training set, model type, and number 
of models per digit. (All models used both energy and the post- 
processor duration model.) The results of Table V show that when the 
set DIG 4 was used as the training set, the performance among all 
four test sets was more uniform than when the set DIG 1 was used as 
the training set. The results also show that with a single model per 
digit, the performance with the DIG 4 training models was comparable 
or better than the performance of the DIG 1 training models with two 
models per digit. Thus the use of the slightly narrower bandwidth 
training data led to models that were more robust to small differences 
in recording conditions than those obtained from the broader band- 
width training data. 


IV. DISCUSSION 
4.1 General results 


In the previous sections we have proposed and tested an HMM 
isolated-word recognizer that uses a continuous mixture density model 
for the probability densities of the feature vector. Based on experi- 
mentation with the recognizer, in a speaker-independent mode, using 
a vocabulary of ten digits, the following general results were obtained: 

1. The proposed model training procedure, with an iterative K- 
means loop for estimating initial values for the means and covariances 


Table V—Comparison of performance of the recognizer as a 
function of the training set, model type, and 
number of models per digit* 


Number Average Digit Error Rate (%) 
a el 
Training Models Test Set 
Set Model Type per Digit DIG1 DIG2 DIG3 DIG4 Average 
DIG 1 ‘Full covariance 2 0.2 0.6 2.2 4.7 2.5 
DIG 4 _ Diagonal covariance 1 2.5 2.4 2.8 0.5 2.57 
DIG 4 _ Full covariance 1 2.5 1.7 2.1 0.8 2.1 
DIG 4 Full covariance 2 2.3 2.2 2.1 0.5 2.2 


ot 


* All models used both energy and the post-processor duration model. 


1228 TECHNICAL JOURNAL, JULY-AUGUST 1985 


of the components of the mixture model, works extremely well in 
practice and was able to converge to a local maximum of the likelihood 
function in a small number of iterations (typically 2 to 4 in most 
cases). 

2. Several speech parameters (most notably the set of cepstral 
parameters and the set of log-area ratios) are well represented by the 
continuous mixture density, and give good recognition performance in 
the HMM recognizer. Other speech parameters (e.g., energy-normal- 
ized autocorrelation parameters, residual-normalized autocorrelation 
parameters, etc.) are not well represented by the continuous mixture 
density, and give relatively poor performance in the HMM recognizer 
when a Euclidean distance measure was used. 

3. Mixture models with diagonal covariance matrices need a some- 
what larger number of mixtures (typically, M = 3 to 5) than mixture 
models with full covariance matrices (M = 1) in order to give the same 
performance. 

4, Combining the techniques of clustering and HMM model building 
can lead to small improvements in the performance of the HMM 
recognizer. 

5. The addition of a word-normalized energy contour (as an extra 
dimension to the feature vector) uniformly improves performance of 
the HMM recognizer and makes it more robust to differences in talker 
populations and transmission conditions. 

6. The addition of duration information, on a state-by-state basis, 
into the HMM recognizer uniformly improves performance and in- 
creases robustness of the recognizer to different talkers and transmis- 
sion conditions. 

7. The combination of normalized energy and duration information 
works better than either factor alone in the HMM recognizer. 

8. Word models with variable number of mixture densities per state 
(based on clustering distortion statistics) yield significantly worse 
performance than models with a fixed number of mixture densities 
per state. 

9. The duration model of the HMM recognizer can be conveniently 
(and suboptimally) implemented as a post-processor to the Viterbi 
decoding procedure. In practice the performance of this system is 
actually better (somewhat) than the recognizer with the duration 
model built into the Viterbi decoding procedure. 

The above results are one measure of the success achieved by the 
continuous mixture density HMM recognizer. Another way of mea- 
suring the success is to compare the current performance results with 
those of alternative recognition systems based on templates” and 
based on discrete densities (i.e., VQ symbols). Such a comparison is 
given in Table VI for the case when the data of set DIG 1 was used as 
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Table VI—Comparison of performance of three types of recognizers 
on the digits database* 


Average Digit Error Rate (%) 


Test Set 
Type of Recognizer DIG1 DIG2 DIG3 DIG4 = Average 
HMM—Continuous density 0.1 0.7 2.8 4.2 2.57 
HMM-— Discrete density — 2.9 — — — 
DTW—Templates 0.0 0.6 2.7 3.9 2.4 


* Training set was DIG 1. 


the training set. For the discrete density HMM recognizer, results are 
given only for the DIG 2 data set where the performance is significantly 
worse than that of the HMM recognizer with a continuous mixture 
density. For the template-based Dynamic Time Warping (DTW) 
recognizer, the results, based on the latest clustering procedure,” are 
comparable to those of the continuous density HMM recognizer. Since 
the template-based DTW recognizer has been studied for about ten 
years and has been highly optimized in its performance, the equality 
between the HMM recognizer and the DTW recognizer, at least for 
the digits vocabulary, is highly significant. 


4.2 Computational considerations 


To calculate the computation required in the HMM recognizer, and 
to contrast it with that required by the DTW recognizer, we must 
define the following: 

N = Number of states in HMM model 
Number of mixture densities per state 
Dimensionality of vectors in signal representation 
Average number of frames (observations) per word 
Number of HMMs per word 
Number of words in vocabulary 
Number of templates per word in DTW system 
= Order of LPC analysis. 

The computation for the HMM recognizer, in the Viterbi decoding 
algorithm, is 


VOruWMHOe 
Il 


Cy=R-V-N-T-C), 


where C, is the computation required to evaluate the mixture density 
b;(x). Assuming that multiplications, divisions, exponentiation, and 
logarithms all take a single multiply-add time (somewhat optimistic 
calculations), then 


C, = 3DM «, + (diagonal covariance) 
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or 
C, = D?M «, + (full covariance) 


and 
Cy = 3D-M-R-V-N-T «, + (diagonal covariance) 
C, = D?M-R-.V-N-T  «, + (full covariance). 


The standard DTW recognizer requires 


or 


T? 
Corw = Q-V- 3 -(pti1)*,+. 


Hence the ratio of Cy to Cprw is 


Cy 3DMRN 





RATIO = "eee = Sot (diagonal covariance) 
Q 3 (p + 1) 
or 
Ratio = ee (full covariance). 
Q rm (p + 1) 


If we choose typical values of N = 5, M = 5, D=9, T = 40, R = 2, 
V=10, Q = 12, p = 8, for the diagonal covariance case, we get 


RATIO = 15/16, 


and using R = 1, M = 1 for the full covariance case we get 
RATIO = 9/32, 


i.e., the computation of the HMM recognizer is essentially that of the 
DTW recognizer for the diagonal covariance case, and about one- 
quarter that of the DTW recognizer for the full covariance case. This 
situation is very different from the discrete symbol HMM recognizer, 
where the computation was an order of magnitude smaller than that 
of the DTW recognizer. The problem with the continuous mixture 
density recognizer is the b;(x) computation, which is extremely expen- 
sive, especially for values of M> 1. 


Vv. SUMMARY 


In this paper we have extended our experimental investigations of 
HMM recognizers to include the case where the density function for 
the observations in each state is represented by a continuous mixture 
of Gaussian variables. We have shown how the parameters of such a 
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signal representation can be optimally estimated (in a maximum- 
likelihood sense) from a finite training set of data, and have given a 
simple way of implementing the training procedure based on a K- 
means iteration. We have tested the HMM recognizer, in a speaker- 
independent mode, on a vocabulary of the ten digits, and shown that 
the error rate of the system is smaller than that obtained from the 
discrete density (VQ-based) HMM recognizer, and comparable to that 
of the multiple template-based DTW recognizer. 
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Maximum-Likelihood Estimation for Mixture 
Multivariate Stochastic Observations of 
Markov Chains 
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In this paper we discuss parameter estimation by means of the reestimation 
algorithm for a class of multivariate mixture density functions of Markov 
chains. The scope of the original reestimation algorithm is expanded and the 
previous assumptions of log concavity or ellipsoidal symmetry are obviated, 
thereby enhancing the modeling capability of the technique. Reestimation 
formulas in terms of the well-known forward-backward inductive procedure 
are also derived. 


I. INTRODUCTION 


Hidden Markov models, which use probabilistic functions of Markov 
chains to model random processes, have been found to be extremely 
useful for stock market behavior, ecology,’? and more recently, speech 
recognition.*”° The effectiveness of this model class lies in its ability 
to deal with nonstationarity that often appears in the observed data 
sequences. The general structure of such a class of models may be 
briefly described as follows. 

Consider a first-order N-state Markov chain governed by an N xX N 
transition probability matrix A = [a,], and an initial probability vector 
wu’ = [uyU2 --- uy]. Obviously, YX, aj = 1 for any i = 1, 2, ---, N, 
and »} Ny u; = 1. a; is the probability of making a transition from state 
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i to state j given that the current state is 1. For any integer state 
sequence © = (Oo, 61, 92, -- - Or), where 0, € {1, 2, --- , N}, the probability 
of © being generated by the Markov source can be easily calculated by 


Pr(O| A, u) = Ug,86 0, °° * Aer O7- (1) 
Now, suppose 9 cannot be directly observed. Instead, we assume that 
what we observe is a stochastic process S = (sj, Se, --- , Sr), produced 
by an underlying (stochastic) state sequence (6), 02, --- Or). Each 


state, say i, manifests itself through a probability density function 
fils), f fi(s)ds = 1. We use F = {f;(-)} to denote such a set of density 
functions. The probability density of S = S 4 (s), so, --- , sr), givena 
specific state sequence 9 generated by the Markov chain with transi- 
tion probability matrix A, and initial probability vector u is thus 


T 
f(S|0, A, u, F) = II fo,(s:). (2) 
t=1 
It then follows that the density of S, given A, u and F, is 
T 
f(S|A, u, F) = 2 Ug Il Ap,_,0,f0,(St)- (3) 
allO t=1 


The triple (A, u, F) & 2 is called a (hidden) Markov model source and 
we write f(S|A) 4 f(S|A, u, F), for simplicity. 

Given an observation sequence S, the objective in maximum-likeli- 
hood estimation is thus to maximize f(S|A) over all parameters in ). 
Such a maximization problem is clearly nontrivial. To solve this 
problem, a reestimation algorithm—developed by Baum et al.’ in 
1970—that guarantees monotonic increase in the likelihood as the 
algorithm iterates, is often used. An auxiliary function, based upon 
the Kullback-Leibler number,® serves as the basis of Baum’s optimi- 
zation procedure, in which parameter estimates are characterized as 
the critical point of the auxiliary function. However, the development 
in Ref. 1 encounters difficulties when the densities {f,(-)} are not log 
concave. The Cauchy density f(s) = a ‘(1 + s*)~', which is only 
concave for 3s” < 1, was cited as one such problematic example. 

More than a decade later, in an effort to obviate the log-concavity 
limitation in Baum’s algorithm, Liporace’ invoked a representation 
theorem by Fan® to redefine the auxiliary function and then success- 
fully extended the reestimation algorithm to accommodate a class of 
elliptically symmetric, multivariate distributions. As a result, each 
fi(s) € F is allowed to assume the form 


| Ri [77h g:(s)), (4) 
where g;(s) is positive definite quadratic, 


&i(s) = (s — m)*Ri(s — ni). 
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The asterisk denotes the transpose of a vector or matrix as we, 
following Liporace, will be dealing with vector observations from this 
point on. The matrix R; is positive definite and symmetric, and the 
location vector 7; is an arbitrary point in the observation space that is 
d-dimensional] Euclidean. 

While Liporace’s results are significant in expanding the scope of 
the reestimation algorithm, the requirements that the observation 
densities be elliptically symmetric are in many real situations still very 
restrictive. In particular, useful parametrizations of speech signals, 
such as reflection coefficients and autocorrelation, have been shown 
by Gray and Markel? and Rabiner et al.,!° respectively, not to exhibit 
the desired symmetry. This lack of symmetry is often observed even 
within each state because of the arbitrariness in choosing the number 
of states for modeling the given process. It is thus the purpose of this 
paper to further obviate the ellipsoidal symmetry assumption so that 
an even more versatile statistical modeling technique than the previous 
ones is obtainable. Levinson also reported the same effort."! 

The class of densities F = {f;(-)} we consider in this paper is the 
class of mixtures of general, strictly log-concave, and/or elliptically 
symmetric densities, having the form 

M 


fis) = ¥ cindie(s), (5) 


where 5;,(s) is general strictly log concave and/or elliptically symme- 
tric and c;, satisfies 


M 
Yd cr = 1 for 1 ee aag iN: (6) 
k=1 


As required in Liporace’s results,’ one extra assumption for elliptically 
symmetric b;(s) is necessary: the density b,,(s) also satisfies the 
consistency conditions of Kolmogorov (see Ref. 12, p. 10) so that 
b;.(-) has the representation 


biz(s) = f M8; nik, VRin,)AG(v) (7) 


for some probability distribution G on [0, ©). In (7), the expression 
As; niz, vRiz) is the multivariate Gaussian density with mean vector 
nix and covariance matrix vR;,. Clearly, {f;(s)}, as expressed in (5), is 
very general and may serve better in the modeling of many complex 
but realistic observations than unimodal, symmetric density functions. 

This paper is organized as follows. The main body of the theory is 
presented in Section II, where the auxiliary function is redefined and 
the reestimation formula for all the parameters is derived. In Section 
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III, applications of the theory to familiar probability densities are 
discussed. Furthermore, for computational convenience, parameter 
reestimation using the forward-backward inductive procedure is also 
provided. 


Il. REESTIMATION 
2.1 Joint density 


For mathematical clarity, the following definitions are necessary. 

Let A be an open subset of Euclidean p space. A hidden Markov 
model ) is a point in A and to each d € A we have a smooth assignment 
4 —> (A(A), u(A), F(A)). One trivial assignment is that dimensions in 
A are one-to-one, corresponding to the parameters defining the triple 
(A, u, F), and thus p is the total number of model parameters. 

Define the state alphabet Q, A {1, 2, ---, N}. Let 7" be the 
(T + 1)th Cartesian product of Q,. The state sequence space is de- 
noted by 71, and © € 07*! means © = (60, 61, --- , Or), where every 
6,€E Q,. 

We further define the branch alphabet , 4 {1, 2, --- , M}. Similarly, 
QF is the set of all T-tuples K = (kj, ko, --- , kr), where every k; € Qy. 
K is called a branch sequence. 

The global density function of (3) with state density defined by (5) 
can be written as 


T M 
f(S|r) = >» At.) Il Eee > cnbra(s) | (8) 


alloca/+ t=1 


The summand in (8) over all © € Q7*! is, in fact, the joint density 
f(S, ©|), which can be expressed as 


T M 
f(S, @| A) = Us I] Q9,_ 16, p> Co,n00,n(S¢) 
t=1 =1 


M M M T 
= >, } ee > lu, II cn ubna(s) 
ky=1 kg=1 kp=1 t=1 
*Co,k, COgko °° * COpkp- (9) 
We further define 
T 
f(S, 0, KA) = Us, Il A1p,_,6,00,4,( St) Co,k, (10) 
t=1 
Therefore, the joint density of the truncated stochastic process S is 
AS|A)Y= yd FS, 0, KA). (11) 
eco! Keof 
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An interpretation of (11) is that there are N7*! possible stochastic 
state sequences that may lead to the observation S, with each possible 
state sequence being a superposition of M7 branch layers. 


2.2 Auxiliary function and an inequality 


For more general theoretical interest, let Q = 07+! x 07 be a totally 
finite measure space with measure y(0, K). The joint density of (11) 
then has the following general form: 


f(S|X) - | #6, 0, K|A)du(0, K). 


Following the concept of the Kullback-Leibler statistic, we define an 
auxiliary function Q(X, \’) of two model points, \ and )’, in A, given 
an observation S: 


Q(A, A’) 4S { ns, 0, K|A)log f(S, 0, K|X’)du(O, K). (12) 


We now have the following theorem: 

Theorem 1: If Q(A, A’) = QQ, A) then f(S| A’) = f(S| A). The inequality 
is strict unless f(S, ©, K|A) = f(S, 0, K[X’) almost everywhere 
du(0, K). 

Proof: Similar to Baum et al., log x is strictly concave for x > 0. 
Hence, 


f(S| A’) f(S, 0, KX’) 





ESI) ode FSD 
KS, 0, KIN f(S, @, KIN’) 
Nee Je Figpny 46 @ KIN) 


ee ae 
f(S1X) f(S, 0, K|)) 


= [A(SITTQ@Q, X’) — QQ, AI] = 0 


du(0, K) 


by hypothesis. The inequality above is due to Jensen’s inequality for 
the measure d{(0, K|A) = f(S, 0, K|A)du(0, K)/f(S|A). This ine- 
quality is strict unless f(S, 0, K|’)/f(S, 0, K|A) is constant almost 
everywhere dé(0, K|X), hence unless f(S, 0, K|\’) = f(S, 0, K|A) 
almost everywhere dy (0, K). 

The significance of Theorem 1 will be discussed below. For simplic- 
ity, we often use the expression of (11) for the joint density and define 
the auxiliary function as 
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QA, A”) = 2 f(S, 0, K|A)log f(S, 0, K]X’) (18) 


ecott! Keof 
as long as the key result of Theorem 1 is held valid. 


2.3 Reestimation algorithm 


Theorem 1 is one of the bases of Baum’s reestimation algorithm 
that is sketched below for self-containedness. For a given observation 
S, the reestimation algorithm starts with an initial guess of the model 
\. The parameter reestimates are then defined to be those that 
maximize Q(X, \’) as a function of )’; that is, the model reesti- 
mate nN stemming from the current model \ is \ = -7(A) € {X E Al 
Q(A, 4) = maxyea Q(A, d’)}. The transformation 7A — A is called 
the reestimation transformation. If Q has a unique global maximum 
as a function of X’, the set {\} has only one element \. Then X plays 
the role of \ as before and new reestimates are determined. The 
procedure iterates until some criterion is met. 

Due to Theorem 1 and the following theorem, the above iterative 
procedure produces a sequence of reestimates that guarantee mono- 
tonic increase in the likelihood f(S |) unless it reaches a critical point 
of the likelihood. 

Theorem 2: Let f(S, 90, K|X) be continuously differentiable in d for 
almost all (©, K) € Q. Let 7 be a continuous map of A — A such that 
for each fixed \, X = (Xd) is a critical point of Q(X, \’) as a function of 
d’. Then all fixed points of the reestimation transformation 7 are 
critical points of f(S |), and if f(S|X) > f(S| A), unless X = J, all limit 
points of F"(Xo) & F( F(T +++ (F(Xo))) «+ +) are fixed points of F 
for any \o € A. 

Proof: Let V, be the gradient vector. 


Vaf(S|A) |x, = Va {ns. 0, K|A)du(O, K) 


7 if f(S0, KIA, log f(S, @ K1d)Idu(@, K) 


Vi QA, (A, d’) N=A+ 
Thus V,f(S|A)],-x = 0 if and only if Vy QQ, A’)|y=, = 0 at 
\ = X. The rest of the proof follows Baum et al.! 
Theorems 1 and 2 thus guarantee that after each iteration, the new 
reestimate \ improves the likelihood, i.e., f(S|) > f(S|A), unless X 
is a fixed point of the transformation. On the other hand, the trans- 
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formation will converge to a fixed point, or equivalently, a critical 
point of the likelihood, if an increase in the likelihood is maintained 
after each iteration and if the limit lim,_,. 7"(Xo) exists, regardless 
of what the initial guess \o(€ A) is. If f(S| A) has finitely many critical 
points, 7"(Xo) approaches a critical point of f(S|) that is at least a 
local maximum. 

The transformation as previously defined requires maximization of 
the auxiliary function. Difficulties encountered in the maximization 
process would directly translate into difficulties in obtaining the 
maximum-likelihood estimate. We next show that if every 5;(-), 
t=1,2,---,Nandj=1,2,--- , Mis strictly log concave or elliptically 
symmetric with the representation (7), Q(A, ’) has a unique global 
maximum as a function of )\’, and thus the transformation exists and 
is single valued. The reestimation algorithm is thus guaranteed to 
work for the joint density of (8). 


2.4 Maximization of the auxiliary function 


The auxiliary function for the joint density (8) with mixture densi- 
ties is defined in (12). The following decomposition can be easily seen: 


log f(S, 0, KA’) 
T T T 
= log uj, + ¥ log as,_.o, + DY log b54,(s:) + x log cj,x,- (14) 
t=1 t=1 t= 


The next theorem suggests that maximization of the likelihood by way 
of reestimation can be accomplished on individual parameter sets due 
to the separability shown in (14), if the following assumptions hold. 
Suppose for almost all (@, K), log f(S, ©, K|\) = ¥41 log f(S, 6, 
K|A,), where for each i and almost all (0, K) log f“(S, 0, K|);) has a 
unique global maximum as a function of );. Note that \ = {),} and q 
is the number of parameter sets after separation. Define Q;(\, Aj) by 


Qi(A, AZ) = J ns, @, K|A)log f(S, ©, K|A:)du(0, K). (15) 


Then for d fixed, Q;(A, 4/) as a function of \/ has a unique global 
maximum , that is a critical point of Q;(\, d/). The reestimation 
transformation 7 is thus defined as 7:\ — d = {),}. We further 
define JF: — hj = {A Aq +++ Ny oo Ag}: 

Theorem 3: Under the above assumptions, for all } € A, and every 
i, f(S| G(A)) = f(S|A) with equality if and only if i; is a critical point 
of f(S| \) with respect to d; or, equivalently, \; is a fixed point of Fand 
furthermore, f(S| 7(A)) = f(S|A) with equality if and only if d is a 
critical point of f(S| A) or, equivalently, a fixed point of 7. 
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Proof: 


Q(, 8) = FBO) + BO, Nd 


2 > QQ, dj) = Q(A, r), 


so Theorem 1 implies f(S|;) = f(S|A). Since Q,(A, AJ) has a unique 
global maximum as a function of )/, the inequality Q(A, A,;) = Q(A, A) 
is strict unless A; = \;. Furthermore, 


QO, X) = ¥ QO, K) > Y QW») = QA, W), 


the second half of the theorem is thus true. 

The separation of (14) is seen to be the key to the increased 
versatility of the reestimation algorithm in accommodating mixture- 
observation densities. Let b;, be the parameter set defining the density 
bj.(s). Obviously, if bj,(s) is multivariate Gaussian, bj, = (nj, Ryx), 
where 7;, is the mean vector and R;, is the covariance matrix. We now 
write the auxiliary function in a separated form using the simplified 
expression of (13) without loss of generality. 


Q(A, A’) x 2 FS, 0, K|A)log f(S, 0, K| A’) 


T 

Yd f(S, 0, K]A) {is ug, + ¥ log ao,_,6, 
@ K t=1 

T 8 

+ ¥ log bj2,(s:) + d log cn 
t=1 t=1 
N 

= Q.(A, a’) + 2X Qa,[, {aj} ] 


N M N 
- 2; >, Q0, bj) + 2X Q.1A, {of al, (16) 


j=l k=1 


where 


Qu(A, w’) = x 2 f(S, 0, K|A)log us, 


M2 


> F(S, 60 = i, K| A)log uf (17) 
K 


o. 
ll 


1 
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T 
Qa[r, faf}Ma] = x LAS, 0, K|)) x log a4,_,0,5(8:-1 — i) 


T 
= » 2 2X f(S, 6-1 = 1, 6: = Jj, K|A)log af (18) 
Q(A, bj) = x 2 f(S, 0, K| r) 
T 
-y log b6,,(8¢)6(0¢ — j)d(k_ — R) 


~~ 
It 


1 


f(S, O, = a ky == k| A)log bje(St), (19) 


M3 


1 


o> 
ll 


and 


T 
Q4(A, ohlta) = DD f(S, © KIA) ¥ log cfa,5(0e — j) 


Mz ° 
Ms 


f(S, 0 =J, ke = RI A)log chy. (20) 


> 
ll 


1 t=1 


The above expression 6(-) is the Kronecker delta function. 
Individual maximization of Q,, Qa, and Q., for i = 1, 2, --- , N subject 
to the constraints 


N 
i. y uj = 1, u; 2 0 
=I 


Se 


N 
2. ¥aj=1, a;20 for all appropriate i andj 
jal 
M 
3: > él, cj = 0, (21) 
j=l 


respectively, is well known.’*"* These individual auxiliary functions 
have the same form ©, wjlog y;, which as a function of {y;}%,, subject 
to the constraints )\%, y; = 1 and y; = 0, attains a global maximum at 
the single point 
yap 7=1,2,--,N. (22) 
ps Wj 
i=l 


The result has been proved in many ways.“ 

When 6;;(s) is strictly log concave in bj, and limjy,,)+« log bj(s) = 
—oo, it is easily seen that for » fixed Q,(A, bj) has a unique global 
maximum that is a critical point of Q,(A, bj,). When b;,(s) is elliptically 
symmetric, the following theorem due to Liporace’ is applicable. 


Theorem 4: If (i) bjx(s) has the representation of eq. (7), and (ii) there 
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are among 5), S2, +++ , Sr, d + 1 observations, any d (the dimension of 
each observation vector) of which are linearly independent, for fixed i, 
Qs(A, bj) has a unique global maximum as a function of bh, = (njrs 
R},), and this maximum is the one and only critical point of Q,(A, bj:). 

Proof of this theorem is easily obtained by following the Appendix 
in Ref. 7. 

The reestimation algorithm has thus been extended to accommodate 
the hidden Markov joint density (8) with mixture observation densi- 
ties. 


Il. APPLICATIONS 


We now explicitly derive the reestimation transformation. By ap- 
plying eq. (22), we can easily calculate W, A, and {cj}, for i = 1, 2, 
--., n, the reestimates that for fixed \} maximize Q,(A, wu’), 
Qa(r, {af}M1) and Q.,(A, {ch}#1) fori = 1, 2, --- , N, as a function of 
u’, {af}%, and {c/,}M41, respectively. 


1. Initial probability: 

Q.(A, wu’) 4 : 2 f(S, 6 = 1, K|A)log uj. 
Hence, fori = 1, 2, ---, N, 

Ui, = ag! (S Oo = 1, Kin / 4S K|)) 


= f(S, 60 = t] A)/f(S| A). (23) 


2. Transition probability: 
For everyi=1,2,---,N 
2 


’ 
T 


M2 


Qa, (A, {aj}ei) = 


f(S, 6-1 = i, & = j, K| d)log aj. 


~ 

tl 
_ 
oS 
roy 


Therefore, on j=1,2,- 
5 VFS, O-1 = 1, 0: = J, =i.Kin / 


t=1 K 
T 


d f(S, O-1 = i, K]) 
K 


=1 


ry 
Ca 


T T 
= yi F(S, 6-1 = 1, 06 =J|A) x A(S, -1 = t]A). (24) 


3. Branch probability: 
For every i= 1, 2,--- , N, 
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M T 
Qe(A, {cin} #1) =2 d F(S, 6 = i, Ry = Rl A)log cj. 
=1 t=1 
Then obviously, fori =1,2,---,N,k=1,2,---,M, 
T he 
c=) f(S,64=1,k:=R|A)/ YD f(S,6=t]dr). (25) 
t=1 t=1 


4, Branch density: 
For everyi=1,2,---,N,andk=1,2,---,M, 


i 
Q.(A, bi.) = 2 f(S, 6 = 1, Rk: = R| A)log bi(s:). 


Maximization of Q,(A, bj) with respect to bj, is well known for many 
familiar density functions. The solution to the maximization problem 
is, in general, obtained through differentiation; i.e., we find b,, that 
satisfies 


Vv, Qa(A, bis) |, jae 


Vv; 0in(S:) 


T 
= > f(S, 6 = l, k, > k| r) bh,(s;) bi, — bi 


= 0, (26) 


For strictly log concave b;,(s), the solution can be easily found. For 
elliptically symmetric b;,(s), 


ba(s) = |Riel7hin(gin(s)), 
where 
Bin(s) = (s — niz)*Ri'(s — nix), 


with representation (7), Liporace’s results apply.’ In particular, the 
solution to (26), i.e., reestimates 7, and R,,, is given by 


f(S, 6, = 1, ky = RI A)-s 
in = —-—_-_— (27) 
> f(S, 6: = 1, ky = RIA) 


t=1 


its 


T 
_ x f(S, 6, = 1, ky = R|X)-(S: — nin) (Se — nin)* 
R, = ——-.-- (28) 
y f(S, 6; = 1, k, = k|X) 


bee | 
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Note that if bz(s) is multivariate Gaussian, the reestimates 7;, and 


R;, above are readily applicable. It is also easy to see that each R,, is 
positive definite. If s is any d-dimensional vector, 


T 
s*Ri. s = Y xin(t)[s*(s: — nx) = 0, (29) 


t=1 


where 
T 
xin(t) = f(S, 0 = i,k, = k|dA) / DY F(S,& =i, ke = RIA) 
t=1 


= 0. 


The inequality (29) is strict provided for any n;, the vectors {s, — nix} 
span the d-dimensional observation space, i.e., the observation process 
S = (81, So, --- , Sr) satisfies the condition laid out in Theorem 4. 


The above reestimates can be conveniently calculated with the 
forward-backward inductive procedure. Define “forward probabilities” 
ao(t) = uj, l= Ly 2; eee , N, and 


a(t) = f(si, Sa, *°° y St, 6: = t| A) 
N 
= > ar-1(J )ajfi(s:), (30) 


fori=1,2,---, Nandt =1, 2, --- , T. Branch densities f,(s) are 
defined in (5). Similarly, define “backward probabilities” 


Bi(t) = f(Se41, Sera, -- + 5 Sr | =U, A) 
N 
= » Bra] aif; (St41), (31) 


and 67(i) = 1, fori = 1, 2, ---, Nandt= T-1, T—-2,.---, 0. 
Further define “branch probability” y,(i, k) 


yet, k) = f(s, S2, °°° » Sty A, = l, k, = k|X) 


N 
= DY aij) ajicindin(s:), (32) 
j=l 
fori=1,2,---,N,k=1,2,---,M,andt=1,2,.--, T. Then, 
f(S, 6, = | \) =, a(t) B,(t), (33) 
which leads to 
N 
f(S|A) = 2X ar(t)B,(2), (34) 
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and in particular at t = T, 


N 
f(S|d) = ¥ ar(i). (35) 
Furthermore, 
f(S, Oy-1 = l, 6, = Jj|A) 
= ae-1 (1) asf; (Se) Bi(J) 
M 
= a-i(t)ay Py cxbals) BJ), (36) 
and 


f(S, 0: = 1, kk = RA) 
= (1, k)B,(7) 


N 
-_ > ar-1(J )ajiCinbdin(S:) 8, (1). (37) 


As a result, the reestimates are expressed in terms of the forward and 
backward probabilities: 


1. Initial Probability: 


N 
iW; = ao(t)Bo(t) > ao(j)Bo(J) 


N 
= a(t) Bolt) » ar(7) (38) 


2. Transition probability: 


T M 
DY aer(i)ay > enbalse) BJ) 
Gy = o  —  _ = _ (39) 
Py aty-1(1) By-1(7) 
3. Branch probability: 
T N 
») > ae-1(J ) ajiCindin(S) Be(2) 
= (40) 
2 a(t) B:(t) 


4, Branch density: 
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Ms 


LD Oe-1( J )Gjicindin(se) Be(Z) + Se 
CT eek ak (41) 


T 
> >» Or—-1(J )ajiCindin (Se) Be (1) 


t=1 j=1 


on 
ll 


and 


M3 
‘Mz 


Or-1(J ) GjiCindin( Se) Bi(1) «(Se — nin) (Se — nin)* 


ol 
oy 

ll 

=r 

~ 

Ul 

_ 


(42) 


ik = T ON 
Py » ae-1(j ) ajiCindin ($2) Be (1) 
== 
Note that the results of (41) and (42) apply to the case of mixture of 
elliptically symmetric densities with the representation (7). Mixtures 
of multivariate Gaussian densities, of course, fall into such a category. 
For other strictly log-concave densities, (26) applies. 

Note that the above results can be easily applied to conventional 
parametric estimation of mixture distributions by setting the number 
of states, N, to unity. 


IV. CONCLUSIONS 


We have extended the reestimation algorithm to accommodate a 
broad class of mixtures of strictly log-concave or elliptically symmetric 
multivariate distributions. The algorithm is particularly useful in 
modeling nonstationary stochastic processes with multimodal non- 
symmetric probabilistic functions of Markov chains that could not be 
dealt with previously. Explicit reestimates in terms of the well-known 
forward-backward inductive probabilities are derived for computa- 
tional ease. Due to the greatly expanded capability of the reestimation 
method, more accurate modeling of sophisticated signals and thus 
improvements in various applications such as speech recognition are 
expected. 
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Model Representations 
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Many signals can be modeled as probabilistic functions of Markov chains 
in which the observed signal is a random vector whose probability density 
function (pdf) depends on the current state of an underlying Markov chain. 
Such models are called Hidden Markov Models (HMMs) and are useful 
representations for speech signals in terms of some convenient observations 
(e.g., cepstral coefficients or pseudolog area ratios). One method of estimating 
parameters of HMMs is the well-known Baum-Welch reestimation method. 
For continuous pdf’s, the method was known to work only for elliptically 
symmetric densities. We have recently shown that the method can be gener- 
alized to handle mixtures of elliptically symmetric pdf’s. Any continuous pdf 
can be approximated to any desired accuracy by such mixtures, in particular, 
by mixtures of multivariate Gaussian pdf’s. To effectively make use of this 
method of parameter estimation, it is necessary to understand how it is affected 
by the amount of training data available, the number of states in the Markov 
chain, the dimensionality of the signal, etc. To study these issues, Markov 
chains and random vector generators were simulated to generate training 
sequences from “toy” models. The model parameters were estimated from 
these training sequences and compared to the “true” parameters by means of 
an appropriate distance measure. The results of several such experiments 
show the strong sensitivity of the method to some (but not all) of the model 
parameters. A procedure for getting good initial parameter estimates is, 
therefore, of considerable importance. 
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I. INFRODUCTION 


The theory of signal representation based on Hidden Markov 
Models (HMMs) is well established and has been applied to text 
analysis,' coding theory,” ecology,® and most recently, speech process- 
ing. © The form of the HMM that we are considering is sketched in 
Fig. 1. The Markov chain has N states, and transitions between states 
are governed by a stochastic transition matrix, A, with elements a,;, 
where 


a; = probability of making a transition to state j, 
given currently in state 1. 


In a given state, j, the observed output of the model is a random vector 
with a probability density function (pdf) 5;. 

Given the model of Fig. 1, it is necessary to be able to estimate the 
model parameters (i.e., the transition matrix, A, and the pdf’s b;) from 
training data consisting of observations of output sequences generated 
by the model. One very useful method of parameter estimation for 
HMMs is the Baum-Welch reestimation procedure.’ For the case of 
continuous pdf’s of interest here, the method was originally shown to 
be valid for log-concave densities.’ This restriction was relaxed by 
Liporace,® who extended the applicability of the method to elliptically 
symmetric densities. However, this class of densities is still too restric- 
tive for many interesting problems (e.g., measured densities of various 





®NN 


Fig. 1—Markov model with N states. 
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speech parameters). Therefore, we consider a more general represen- 
tation of the density—a finite mixture of the form 


M 
b; (x) o Y CimV [X, Hm, Un) j= 1, N, (1) 
m=1 


where Y may be any log-concave or elliptically symmetric density, 
and is assumed to be Gaussian in our present study. The vector x is 
the observation vector. The vector jm and the matrix Ujm are, respec- 
tively, the mean vector and the covariance matrix for the mth mixture 
component in state j. The coefficients, cjm, are the mixture gains, and 
satisfy the stochastic constraint 


M 
> ea Cin = 0, (2) 
m=1 

so that 


i b;(x)dx = 1, J=1,N. (3) 


The representation of eq. (1) can be used to approximate arbitrarily 
closely any finite, continuous density function; hence its appropriate- 
ness to a wide range of problems. It has recently been shown’ that the 
reestimation procedure of Refs. 7 and 8 can be extended to cover the 
mixture representation of eq. (1). 

To understand the properties of such HMMs, and to study the 
sensitivity of the parameter estimates to the details of the estimation 
procedure, we have simulated several “toy” models and examined the 
effects of sample size, initial parameter estimates, model inconsisten- 
cies, etc., on the corresponding estimated models. In this paper we 
present the results of our simulations. Since we have studied only a 
few, carefully selected cases, we make no claims about specific sample 
sizes, range of initial parameter values, etc. Instead, it is intended that 
the examples presented allow the reader to understand the nature of 
the representation, and thereby use it appropriately for his or her 
particular application. 

The outline of this paper is as follows. In Section I we show how a 
toy model or HMM sequence generator can be implemented to provide 
appropriate training sequences for estimating model parameters. In 
Section II we review the continuous density HMM. In Section III we 
describe a series of experiments designed to study the sensitivities and 
properties of the HMM signal representation. Finally, in Section IV 
we review the key results and discuss their implications for practical 
problems. 
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I]. REVIEW OF THE CONTINUOUS HMM 


Consider an N-state Markov chain where we label the states qi, qo, 

-, Qn. The Markov chain is characterized by its state transition 
matrix A = [a,]. Each state q; is characterized by a continuous 
multivariate, probability density function b;(x), where x is a K- 
dimensional observation vector. 

Given a sequence of observations, O = O,, Oz, --- , Or, where each 
O, is a K-dimensional vector, we can calculate the likelihood of O, 
given a model M. We denote the likelihood as “(O|M). Following 
Baum,’° we can define a set of forward and backward likelihoods, a;(i) 
and 8,(j) respectively, where, for 1 <1,7< N,and1<st<T, 


a(t) = L(O,, Oo, ---, O; and q; at time t|M) (4) 
and 
B.(J) = FP (O41 On42 eee Or|q; at time t and M). (5) 


Baum has shown that a;(i) and 8;(j7) can be computed recursively. 
Assuming that we start in qi, whereby ao(1) = 1, ap (i) = 0,2 S 
is WN, and Br(J) = 1,15) <N, then for1 < ts T we get 


N 
ar(j) = b waa (00, (6) 
and for T-—12>t=0, 
N 
B, (1) = 2X aij; (Or+1) Bei (J). (7) 


Thus, #(O|M) can be efficiently evaluated as 
N 


N 
£(O|M) = » ae (1) a5; (Or+1) Brrr (J), (8) 


i=1 j= 


for 0 s t s T — 1. The parameters of the HMM are estimated by 
finding some M that is a local maximum of #(O|M) for a given 
observation sequence O. 

Using the mixture density of eq. (1) as the parameterized pdf’s 
b;(x), the model M is specified by the following: 


number of states in the model 

number of mixture densities for each distribution 

number of dimensions of each observation vector 

[a,;;] = state transition matrix 

= [cjm], where cj = mixture gains for mth mixture in state j 

= [pjmr], where jim, = the kth component of the mean vector pjm 
for mth mixture in state j 


FaAPRES 
il 
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U = [Ujmu], where Ujmz = the (Rk, !)th entry of the covariance matrix 
for the mth mixture in state j. 


Given the values chosen for N, M, and K, and a set of initial guesses 
for A, C, p, and U, a set of reestimation formulas is available® for 
optimizing “(O|™M), for a given training set of observations O. 

There are two general cases of the model that are of interest, namely 
the ergodic case in which the Markov chain is ergodic (i.e., all states 
are aperiodic and recurrent nonnull) and the left-to-right case in which 
a transition from state gq; to state q; is possible only if j = i (i.e., there 
is a sequential progression through states of the model). Both general 
cases are of interest for real-world applications. 


2.1 Toy Markov model generator 


In order to investigate the behavior of the parameter estimation 
algorithms for the continuous HMM, a toy Markov model generator 
was implemented. Its function was to generate an observation sequence 
(for the ergodic case), or a set of observation sequences (for the left- 
to-right case), for an input model specification M;,. Each observation 
generated by the model was a K-dimensional vector according to the 
probability density b;(x) for the jth state. 

The algorithm used to generate the observation sequences is the 
following: 

1. Set the state index, j = 1 and the time index, t = 1. 

2. Partition the unit interval proportionally to cm, 1< ms M. 
Generate x, a random number uniform on [0, 1]. Select the mixture 
density, |, according to the subinterval in which <x falls. 

3. Decompose Uj; into QAQ’, where Q is the matrix of eigenvectors 
of U;, and A is the diagonal matrix of eigenvalues of Uj. 

4. Generate a K-dimensional normal deviate, y, of zero mean and 
covariance A. 

5. Set O, = Qy + py. 

6. Partition the unit interval proportionally to aj,, 1s k s N. 
Generate x, a random deviate uniform on [0, 1] and select the next 
state, 1, according to the subinterval in which x falls. 

7. Increment t. 

8. If t s T go to 2; else, stop. 

The Markov model generator was specified by a model M;,, and by 
a limit on the number of observations 7, or on the number of sequences 
Q (for the left-to-right case). Each individual sequence, in the left-to- 
right case, started in state g, (at observation 1) and terminated in 
state gn (at observation T), with the property that it had to have been 
in state qn for at least L observations. (Typically L was 5 to 10.) 
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Ill. EXPERIMENTAL EVALUATION OF THE REESTIMATION PROCEDURE 


A series of experimental evaluations of the reestimation procedure 
were made to determine the sensitivities of the algorithm—and hence 
the resulting HMMs—to aspects of the observation sequence used to 
train the model. Using the toy Markov model generators, several input 
source models were defined (i.e., the model parameters were specified) 
and: several sets of observation sequences were generated from the 
models. For each input source model, the reestimation algorithm was 
used to obtain locally optimal model parameters based on the gener- 
ated sequences and initial estimates. The resulting model M was 
compared with the source model using a probabilistic distance 
measure”? of the form 


logl-2(Om,, | Min)] — logl2(Om,, |M)] 


D(Min, M) — T. ’ 


(9) 
where Oy,, was a set of observations generated by the toy model Min, 
and T;,, was-the total number of observations in this set. The distance 
measure of eq. (9) gives the normalized difference in log likelihoods of 
the observation sequence coming from the true toy model, and of the 
likelihood of its coming from the estimated model, where the normal- 
ization is the number of observations in Oy,,. Previous experience 


with D has shown that this measure is very effective for comparing 
HMMs.” 


3.1 Correlation of model distance to changes in model parameters 


Before investigating the sensitivities of the reestimation procedure 
to various model parameters and initial conditions, a preliminary 
experiment was run to measure the correlation of model distance to 
changes in model parameters. For this experiment, the initial (ergodic) 
model had the specifications 


M=N=K=2 


_[o8 02 _ [0.75 0.25 
ane oe Wa 


T.- 'b: 5. 9. 
n= [b 5], n= [> 2 
_|4 2 _{7 3 _{10 2 
Ui. oa b 7 ’ Ua. a b 3 Oye. = b al 


Ux... = f 1 : 
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A new model was created in which all model parameters remained the 
same except for one set, in which the columns of the corresponding 
matrix or matrices were reversed, i.e., a; became a;;, Or cjm became Cn;, 
etc. In this manner we could study the effects of changing only a single 
parameter set on the model distance. A smooth interpolation between 
the parameter set for the initial model and the reversed parameter set 
was made by changing the parameter set in steps, and then measuring 
model distance at each step. In particular, if we denote the matrix in 
the initial model by X and the reversed matrix in the new model by 
X’, the intermediate matrices X” were formed by 


oS ke 


where the deviation factor 6 = «, 2¢, --- , 2’"e and e = 0.016. 
The results of this preliminary experiment are shown in Fig. 2, 


which gives a series of plots of model distance D, versus signal-to- 
noise ratio y, defined as 


(| X |}? 
|X -— X” |?’ 
where || - || denotes matrix norm (|| X ||? = ¥; 3; xi for X = [xj]) for 


changes in A, C, and yw. (Curves similar to that for » can be obtained 
for changes in U.) It can be seen that perturbed models are far more 


Y = 10 logio 


0.25 


O F VARIABLE 
@ A VARIABLE 
A C VARIABLE 


DISTANCE 





i¢) 30 40 
SIGNAL-TO-NOISE RATIO IN DECIBELS 


Fig. 2—Distance as a function of parameter deviation for changes in yp, A, and C. 
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distant (in the probabilistic distance sense) from the initial models 
when yp is perturbed than when the transition probability or the 
mixture gain parameters are perturbed. In general, the HMMs are 
indeed much more sensitive to small errors in u values than to small 
errors in C or A values unless the variances are extremely large. The 
exact sensitivity will depend on the precise relationships among means 
and the associated variances. 

The effects of the detailed relationships among means and the 
associated variances can be seen from the nonmonotonic behavior of 
the distance curve pertaining to yu in Fig. 2. In this particular case, the 
mean vectors moved from (1,5) to (5,1), (3,4) to (4,3), (5,9) to (9,5), 
and (8,2) to (2,8) as the signal-to-noise ratio decreased from about 
38.6 dB to 2.6 dB. Thus, as seen in Fig. 2, when y drops below 8 dB, 
the probabilistic distance for » deviations actually decreases. One may 
observe that in some section along the perturbation path from (8,2) to 
(2,8), the perturbed mean moves closer to the original mean locations 
(1,5) and (3,4), and thus results in a decrease rather than increase in 
the probabilistic distance. 

We should point out that if we arbitrarily increase the number of 
mixture components in modeling a given density, then, with proper 
choice of the initial estimate, the obtained mixture weights become 
proportional to sample values of the density function at the mean 
locations of the mixtures. When this happens, the variance in each 
mixture density, as well as the spacing of the means, decreases and, 
asymptotically, the observation density is mainly characterized by the 
mixture gains and the mean vectors. Thus, there is a continuum in 
the observation of relative model sensitivities as the number of mixture 
terms varies. 


3.2 Sensitivities of the reestimation procedure to parameter inaccuracies 
and to the training sequence 


Based on the results given in the previous section, a series of 
experiments were performed to investigate the sensitivities of the 
reestimation procedure to initial parameter estimates and to the length 
of the training sequence. 

The first experiment used a left-to-right source model with the 
characteristics 
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The initial guess of the model parameters was random for A and C, 
and identity matrices for U. For p, the initial guess had the form 


= (1 — 2-a)p, (10) 


where a is a random variable uniformly distributed on (0,2), and z is 
a user-specified error bound, which limits the maximum possible 
deviation of yn’ from p in the source model. 

The source generated Q random sequences according to the specified 
model, where Q varied from 10 to 100 (in steps of 10) and initial 
estimates with values of z = 0.0, 0.2, 0.4, and 0.6 were used. For each 
set of observations, and for each initial estimate, the reestimation 
procedure was iterated until a stationary point was found. At this 
point, both the average (negative) log likelihood for the estimated 
model M and the model distance (from the source to the estimated 
model) were calculated. Figure 3 shows a series of plots of the average 
(negative) log likelihood, and the model distance, as a function of the 
total number of observations in the Q training sequences, for the four 
values of z. For values of z = 0 and 0.2, for sufficiently long training 
sequences (i.e., 20 sets of observations or about 400 observations), the 
model distances were reasonably small (less than 0.25). As z got bigger, 
thereby making the initial estimates of yu poorer, the resulting models 
had distances on the order of 0.4 or larger. It is clearly shown that the 
accuracy of the estimated model depends on the initial estimate from 
which the iterative reestimation procedure starts. A converged model 
estimate is only a local optimum and, in general, has a lower likelihood 
than the global optimum. 

Figure 3 also shows the correlation between the log likelihood and 
the model distance. For sufficiently long observations, the model 
distance is a good predictor of the relative log likelihood for the 
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Fig. 3—(a) Distance and (b) average log likelihood as a function of the number of 
suey evions in the training sequence, and as a function of the initial estimation 
eviation, z. 


estimated model. When a smaller number of observations is generated 
for estimation, the statistical characteristics of the source become less 
well represented in the generated observation sequences, and hence, 
the estimated model is more data specific, resulting in greater varia- 
tions in the log likelihood. This can be more easily seen from the 
behavior of distance for a specific set of training sequences as the 
reestimation procedure iterates to a stable solution. Such a plot is 
given in Fig. 4 for Q = 20 sequences (part a) and for Q = 50 sequences 
(part b). (Note that these two curves show the distance behavior of 
the model reestimate as it converges to the solution corresponding to 
the two particular points, P,; and P2, in the upper curve of Fig. 3, 
respectively.) For Q = 20 sequences, the training set does not provide 
a good characterization of the source model—it is too short; hence the 
model distance decreases for a couple of iterations and then increases 
as the local estimated parameters are adjusted to match those of the 
specific observation sequence rather than those of the true generating 
model. For Q = 50 sequences, the distance between the estimated 
model and the true source model steadily decreases. 


3.3 Sensitivity of model reestimation to evaluation of the density function 


Because of the wide dynamic range of the density function, };(x), of 
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Fig. 4—(a) Distance for Q = 20 sequences training and (b) Q@ = 50 sequences training 
as a function of the iteration number. 


eq. (1)—especially when the estimates of » and U are in error—a 
minimum value clipping level, forrp, is usually required to avoid poten- 
tial underflow and singularity problems. In our study, whenever 5; (x) 
was less than 10~““r, it was artificially clamped at 10~c'r; otherwise 
b;(x) was kept as computed. This, in effect, injects certain noise 
components into the observations. To understand the effect of four 
on the resulting model estimates, a left-to-right, four-state, one-mix- 
ture, two-dimensional model was used, with the specification 


N = 4, M=1, K=2 
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The value of four was varied from 70—which is essentially full 
precision on the Data General MV8000 32-bit computer—down to 
10—severe clipping—and initial estimates of uw’ were generated as in 
eq. (10). For each value of four and z, the distance between the model 
resulting from the reestimation procedure and the original source 
model was computed, and the results are plotted in Fig. 5. For all runs, 
the number of observation sequences used in training was 50; hence 
there was an adequate number of observations for the parameter 
estimates to converge to the true model. The results given in Fig. 5 
show that for z = 0 the distance is insensitive to values of fc. over 
the entire range. For z = 0.2, the distance is much larger for four = 10 
than for all other values of fcr. The differences in distance between 
the results for z = 0.2 and those for z = 0.0 are insignificant except 
for those at fcu = 10. For z = 0.4, the model estimates yield larger 
distances than for z = 0 or z = 0.2 for all values of forrp. The differences 
for fcuy values of less than 50 are primarily due to the sensitivity of 
the reestimation procedure to the initial w estimates as discussed 
previously. The differences for fcirp in the range 10 to 40 are due to 
sensitivities of the reestimation procedure to the clipping itself. To 
understand this sensitivity, consider Fig. 6, which shows a Gaussian 
with a clipping threshold 10~/u”. In the case that initial estimates of 
p (and U) are very close to the true value, the density function will 
rarely, if ever, be clipped; hence until 10“? approaches the peak of 
the density function, the clipping has little effect on the model esti- 
mate. In the case where initial estimates of w are far from the true 
value, a large percentage of the density computations will be clipped 
and the reestimation procedure will be unable to improve the param- 
eter estimates because the density function is essentially flat in the 
region of the clipping. For such cases, very poor estimates of pw result 
and large model distances are obtained. The results point out an 
important consideration in practical implementations of the estima- 
tion algorithm, where finite precision is inevitable. 


3.4 Modeling correlated processes by mixtures of uncorrelated processes 


The mixture form of eq. (1) is a very versatile and flexible represen- 
tation of the pdf in each state. For example, a complicated multivariate 
pdf may be approximated by a mixture of Gaussian multivariate 
densities with full covariance matrices, or, by increasing the number 
of mixture components, a mixture of Gaussian multivariate densities 
with only diagonal covariance matrices (i.e., vector elements are un- 
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Fig. 5—Distance as a function of the density clipping threshold for z = 0, 0.2, and 
0.4. 


correlated). To better understand this concept, we studied the trade- 
off between the degree of correlation and the number of mixture 
components in the representation by modeling correlated multivariate 
densities with different numbers of uncorrelated multivariate densities 
using the HMM framework. 

The source model used for these studies had the following specifi- 
cations: 
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where p varied from 0 to 0.9 (in steps of 0.1). Thus, the source model 
had a full two-dimensional covariance matrix with correlation p be- 
tween components of each vector. 

We considered two separate HMMs for matching the observation 
sequences of the full covariance source model. The first model used 
an M = 1 (a single) mixture with a diagonal covariance matrix; the 
second model used an M = 5 mixture, where each component density 
again had a diagonal covariance matrix. Since we were interested only 
in the capabilities of the models—and not in the concomitant problems 
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Fig. 6—-Explanation of the sensitivity of the reestimation algorithm to the density 
clipping threshold for different values of z. 
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of reestimation—the initial estimates of the model parameters were 
selected to optimize the match. Thus for the M = 1 model, the initial 
estimates were identical to those of the source, except the off-diagonal 
covariance terms were set to 0. For the M = 5 model, the initial 
estimates of » and U were adjusted in order to best match the full 
covariance with correlation p by the M = 5 mixtures. The procedure 
used is illustrated in Fig. 7, which shows a K equals a two-dimensional 
correlation in the (x, x2) plane (part a), and a one-dimensional slice 
(part b). The initial estimates of » for the M = 5 case are shown by 
the center dots of the five circles in part a. The mixture gains and the 
mixture covariances were chosen to provide good initial fits to the 
correlated covariance as shown in both parts a and b. 

The results of estimating optimum models for the M = 1 and M = 
5 cases are shown in Fig. 8, which gives plots of model distance versus 
p. For M = 1, the model fits have distances less than or equal to 0.1 
only for p < 0.35, and have distances less than or equal to 0.2 for p S 
0.5. For M = 5, the model fits have distances less than or equal to 0.1 
for p = 0.7, and have distances less than or equal to 0.2 for p up to 0.9. 
Thus, for this case, the M = 5 mixture models without correlations 
provide excellent approximations to models with correlated random 
variables up to correlations of 0.9. 

The results presented above show that it is possible to model a K- 
dimensional (K = 2) correlated random process by a mixture of M- 
uncorrelated, K-dimensional, Gaussian random processes. The ques- 
tion that remains is why one would be interested in using such an 
approximation. There are two possible reasons that readily come to 
mind. First, there is the possibility that more reliable estimates can 
be made of the set of 2M-K means and variances for the M-mixture 
uncorrelated processes case, than for the set of K(K + 3)/2 means and 
correlations for the one-mixture correlated process. If this is the case 
the trade-off is between the increased error in the approximation 
process and the increased reliability in the estimation process. The 
second possible reason for using the M-mixture uncorrelated process 
instead of the one-mixture correlated process is the potential for a 
decrease in the number of parameters that need to be estimated. To 
see when this can occur, we define P, as the number of parameters for 
the one-mixture correlated density, and P, as the number of parame- 
ters in the M-mixture uncorrelated density case. Then, assuming K- 
dimensional vectors, we get 


_ KK + 3) 


iP; 5 


and 
P, = M(2K + 1). 
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For P, <= P, we require 

K(K + 3) 

2(2K + 1)" 

Thus, for K < 5, the largest M can be is 1, for 9 = K = 6, the largest 
M can be is 2, and for 13 = K = 10, the largest M can be is 3 to realize 


Ms 





Fig. 7—(a) Observation region in the (x, x2) plane for highly correlated vector 
components along with initial estimates of » for M = 5 model; (b) interpretation of 
initial estimates along a one-dimensional projection of the (x, x2) plane. 
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Fig. 8—Distance versus p for M = 1 and M = 5 mixture fits using diagonal covariance 
matrices. 


any reduction in the number of variables to be estimated. For speech- 
recognition applications, we generally use K ~ 10; hence values of 
M s 8 could be considered. Whether or not the model is adequately 
represented with this many diagonal mixtures depends heavily on the 
specific application. The purpose of the above discussion is to point 
out the possibilities of the alternative method. 


IV. DISCUSSION 
The results presented in the previous section have shown the follow- 


ing: 
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1. Continuous HMMs characterized by mixture densities are most 
sensitive to estimation errors in the locations of the means of each 
mixture density. If the error in the initial estimate of the mean becomes 
sufficiently large, then the reestimation procedure has very little 
chance of giving good model parameter estimates. 

2. The sensitivity of the models to errors in initial covariance 
estimates is less than that due to errors in the initial mean estimates. 

3. The sensitivity of the models to errors in either transition matrix 
coefficients, or mixture gains, is low. Hence, good model estimates can 
be obtained even with poor initial estimates of these parameters, as 
long as the distribution does not contain singularities. 

4. We have found that observations on the order of 500 to 1000 are 
adequate for models that are typical of many applications in speech 
processing (e.g., models with N = 10, K = 10, M = 3). 

5. Good initial parameter estimates become critical in the reesti- 
mation procedure when word precision for the evaluation of the density 
function is limited—an inevitable situation in practical implementa- 
tions. 

6. Mixture density models with diagonal covariance matrices for 
each mixture can be used to approximate full covariance models. 

The most important conclusion from our experiments is that it is 
absolutely mandatory to have a good initial guess of the means of the 
density functions to obtain good HMMs. With a good initial guess of 
the means, the parameter reestimation procedure is capable of yielding 
good models even if other model parameters have poor intial estimates. 


V. SUMMARY 


Several interesting properties of continuous density HMMs have 
been discussed. These include model sensitivity to initial parameter 
estimates, to evaluation of the density function, and to size and type 
of training sequence. We have shown how a mixture density of uncor- 
related variables can successfully represent a model with highly cor- 
related variables, as long as enough mixtures are used. The results 
presented here can be applied to a variety of real-world problems. 
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A variable-length packetization process is proposed for logarithmic Pulse 
Code Modulation (log-PCM)-encoded speech signals. Blocks of W log-PCM 
words are reduced in size by discarding words on the basis that the receiver 
can recover the speech with a signal-to-noise ratio (s/n) that is, in general, 
above a specified value. Specifically, blocks of two hundred fifty-six, 8-bit, 
p-law PCM words are left unchanged or reduced to either 214, 205, 192, 171, 
or 128 words. These blocks are then formulated into packets. The discarded 
samples at the dispatching terminal are replaced at the receiver by means of 
adaptive interpolation. We found that by specifying the s/n of the decoded 
speech in each variable-length packet to be above 27 dB, the reduction in the 
transmitted speech data was 25 percent, while the recovered speech signal had 
negligible perceptual degradation. 


I. INTRODUCTION 


Speech signals are inherently bursty. Indeed, conversational speech 
is composed of speech segments sandwiched between variable dura- 
tions of silences and interword pauses, as well as intraword gaps. Even 
during a speech utterance, the information rate is not constant and 
can exhibit surges. Stated simply, the statistical properties of speech 
are nonstationary, and our speech encoders are designed to exploit in 
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various degrees this nonstationarity. However, the majority of existing 
speech encoders’? generate data at a fixed rate, and as the information 
content of the speech source is varying, so is the recovered speech 
quality. If we stipulate that the speech quality should be maintained 
above a specified level, then the variation in the information content 
of the speech signal causes the bit rate to vary. Earlier attempts at 
variable-bit-rate encoding have usually employed a buffer to convert 
the variable bit rate generated by the encoder to a fixed rate for 
transmission over the channel. Such arrangements suffer from long 
delays in the recovered speech signal and are often susceptible to 
transmission errors. However, when the transmission medium can 
accommodate many channels, variable-rate digitized speech can be 
used with advantage. By converting this variable data rate into packets 
of varying length, we are able to reduce the overall data rate for a 
given number of speech channels. These packets are amenable to 
statistical multiplexing, which is essentially a generalization of digital- 
speech interpolation. By employing packet switching,* we can conven- 
iently combine statistical multiplexing with switching and routing 
procedures. Thus, the line of enquiry to be followed here is concerned 
with the formulation of packetized digital speech, and, in particular, 
variable-length packets that utilize the time-varying statistics of 
speech signals. 

It is interesting to observe that there are many speech-encoding 
algorithms that operate on blocks of speech samples.’* The resulting 
blocks of digital speech are eminently suited to packetization. All that 
is required to yield packets of digital speech is the addition of suitable 
headers and flags to the blocks. These additions enable the multiplexer 
and the switching and routing network to identify the beginning and 
termination of the packets, and give the receiver access to any decoding 
commands. Although the number of digital encoders conceived are 
legion, only a few types are employed in the networks. Foremost is 
logarithmic Pulse Code Modulation (log-PCM),’ which is formidably 
entrenched in the international networks. Adaptive differential pulse 
code modulation is now being introduced® on trunk networks, and 
adaptive delta modulation’ has been used in a limited way in subscriber 
loops and over satellite links. In this discourse we will focus on the 
variable-length packetization of log-PCM speech. 

In Section II we review the adaptive interpolation procedures to be 
used in our variable-length packetized p-law PCM-encoded speech 
system. The technique of determining the packet length is described 
in Section III. For each packet we ensure that the recovered speech 
quality at the receiver is, in general, above a minimum specified value. 
In Section IV we discuss our results. The final section deliberates on 
the properties and ramifications of the proposed packetization method. 
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H. APPLICATION OF ADAPTIVE INTERPOLATION TO REDUCE THE 
TRANSMITTED DATA RATE 


Starting with blocks of W speech samples, it is possible to reject 
certain samples at the transmitter with the knowledge that the receiver 
is able to replace the discarded samples with virtually an imperceptible 
degradation. The replacement technique® employs adaptive interpo- 
lation. Thus, we commence with a block of W speech samples, discard 
some samples prior to transmission, and at the receiver reinsert the 
missing samples to yield speech that is acceptable according to a 
simple objective criterion. For a given recovered speech quality crite- 
rion, the number of samples discarded in any block is dependent on 
the correlative properties of its W speech samples. When the samples 
are highly correlated more samples are rejected prior to transmission, 
and vice versa when the samples are relatively uncorrelated. This is 
because of the nature of the interpolation procedure. In general, the 
W samples in each block are reduced in number. We may, therefore, 
think of packets having W speech samples converted to packets 
containing fewer speech samples. The length of the packet, i.e., the 
number of samples in the packet, depends upon the ability to discard 
samples in the confident knowledge that the receiver will reproduce 
W samples with a speech quality that is above a specified value. 
However, in our deliberations we will not be concerned with variable- 
length packetization of samples, but with log-PCM speech data. This 
matter will be dealt with in Section III. 

Next we will briefly review the adaptive procedure used to replace 
the discarded samples. This procedure is cardinal in controlling the 
length of the transmitted packet. 

The method is fully described in Ref. 8 and only a brief review will 
be presented here. Essentially, speech is sampled at or above the 
Nyquist rate to give the sequence {x;}. Operating on a block containing 
W contiguous samples of this sequence, we reject every nth member 
in order to achieve a reduction in the speech data. The discarded 
samples are replaced by 


ie r=n,2n,...,W—n, W 
to give the recovered sequence, 
Xi, eeey Xn-1> Zn Xn+1) a} X2n-15 22n) Xon+1) eee Xw-12w. 


The interpolated samples are 


n~1 
2-= DY aiXrsis (1) 


i=—(n-1) 


where aq; are the interpolation coefficients, ay = 0. The set of interpo- 


SPEECH SIGNALS = 1273 
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R(-6,5) |A(-5,5) Al-4,5) Al-3,5) A(-25) A(-1,5) AUS) Al25) (3,5) Al45) R165) 





Fig. 1—Matrix A for n = 2, 3,..., 6. 


lation coefficients that minimizes the mean-square interpolation error 
€, = X — 2,5 r=n,2n,..., W, (2) 
is 
a= [ai-n; Qo-n, ++. ) a-1, Q1,..-- 9 Qn-25 Qn-1]* (3) 
and is found from 
a= AC, (4) 
where 
C = [R(0, 1 — n), R(0,2—n),..., R(O, —1), R(O, 1),..., 
R(0, n — 2), R,n—1)]’. (5) 
The superscripts —1 and T represent inverse and transpose operations, 
respectively. The matrix A for n = 2, 3, ..., 6 is shown in Fig. 1, 
where the innermost dotted enclosure, the next dotted enclosure, and 
so on, and the complete matrix refer to n = 2, 3, ..., 6, respectively. 
We will not perform adaptive interpolation with n > 6 for the reasons 
stated in Section 3.1. The elements in the matrix A, and in the vector 
C given by eq. (5), are 


Ww 
> Xr+kXrsj 


r=n 


W 
Dy Xrej 
rn 


R(k, J) 


k=+1,+2,...,4n-1 
Poth, 22)5..54n—1 


r=n,2n,..., W. (6) 


1274 TECHNICAL JOURNAL, JULY-AUGUST 1985 


Observe that 
€n7 65, pPH1, 22,8 K 1 (7) 


and, therefore, that the number of interpolation coefficients to be 
computed is twice that required when the simple correlation coefficient 
W-r 
>, XpXp+r 
r=1 
RG): = PSH Ay 2 ag We (8) 


DXF 
r=1 


is employed. The use of R(r) yields significantly greater interpolation 
errors than does the use of R(k, j).° 


Ili. VARYING n TO DEMAND A BLOCK SIGNAL-TO-NOISE RATIO 
ABOVE A SPECIFIED LEVEL 


In the previous subsection, we summarized (1) the procedure for 
removing every nth sample in a block of W speech samples and (2) 
the reinsertion of these discarded samples at the receiver by using 
adaptive interpolation based on the transmitted samples. Thus, 
the interpolated sample Z,, given by eq. (1), is the weighted sum of 
2(n — 1) neighboring speech samples that were transmitted. In the 
original paper® the value of n was a constant, although the effect of 
varying n was studied. When the signal-to-noise ratio (s/n), computed 
over a block of length W samples, was plotted as a function of block 
number during a sentence of speech, it exhibited huge variations. 

For example, for W = 256, n = 4, the s/n in a block could have been 
as small as 4 dB and as large as 58 dB, while the average of these 
block s/n’s, the segmental s/n, was 35 dB. This implies that when the 
block s/n was low, the speech was relatively uncorrelated and n should 
have been significantly in excess of 4. By contrast, a high-block s/n 
meant that the speech correlation was so high that more samples could 
have been discarded, and a value of n = 2 might have still yielded an 
acceptable block s/n. To attempt to guarantee a minimum block s/n, 
we must be prepared to vary n, even allowing for n to be infinity, i.e., 
no samples rejected, when the speech is very uncorrelated. The value 
of n clearly determines the number of samples in the transmitted 
block, being as large as W when n is infinity and as small as W/2 
when n is 2. As blocks are virtually synonymous with packets, we are 
led to the notion of variable-length packets. 


3.1 Generation of variable-length packets 


So far our discussion of adaptive interpolation has involved the 
processing of analog speech samples. We will not, however, be con- 
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cerned with the transmission of samples. Rather, we will describe the 
packetizing of u-law PCM-encoded speech. The packets are formulated 
at a transmitting terminal, gathering the encoded speech words as 
they are produced. (Alternatively, the packetization may operate at a 
node in the network, converting a 64-kb/s data stream into a sequence 
of variable-length packets and thereby reducing the overall data rate.) 

Figure 2 shows the system arrangement for the creation of variable- 
length packets when packets are created at the subscriber’s packet 
network terminal or interface. Figure 3 is the similar system on the 
receiver side. The input speech sequence is p-law PCM encoded, and 
the resulting words are stored in buffer 3. The y-law PCM-encoded 
speech is locally decoded to give quantized speech samples that are 
directed into buffer 1. While this is in progress the sequence of 
previously decoded samples {x;} is removed from buffer 2 for process- 
ing. Buffers 1 and 2 are of length W samples. Buffer 3 holds the 
contents of buffers 1 and 2, but in the yu-law format. Starting with 
n = 2, every other sample in {x,} is discarded to yield the sequence 
{z,}. The two interpolation coefficients a_; and a, are computed from 
{x,} using eq. (4). These coefficients must be encoded for transmission, 
so they are scaled by a factor K to ensure that the largest coefficient 
will not exceed the range of the quantizer. The quantized coefficients 
are binary encoded and, if selected for transmission, are conveyed to 
the multiplexer (MUX). The coefficients are also locally descaled by 
1/K to give the decoded coefficients a_;, a, that a receiver would have 
to use, assuming they would be regenerated without error. Operating 
on {z;} and {a,}, the discarded samples are reinserted using the adaptive 
interpolation procedure described in Section II. The locally recovered 
sequence {2,} and the corresponding block of input speech samples {%;} 
are then used to calculate the block s/n: 


x2 


Ms 


k=1 


(s/n)n = 10 logio| =-————— (9) 
2 (X — 2x)? 


This block s/n, (s/n),, is compared with a reference s/n, (s/7)ree, 
and if 


(s/n)n = S/Mret} n accepted, (10) 


then the value of n = 2 is accepted. Otherwise, the process is repeated 
using the next higher value of n, namely, 3. Should this later condition 
prevail, the coefficients G2, G_1, @,, G2, are calculated, and this time 
{z,} has every third sample missing. The new {z;} is formulated, the 
(s/n)3 computed, and inequality (10) tested. If this inequality is not 
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Fig. 2—System for generating variable-length packet speech. 
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satisfied, n is increased to 4 and another iteration is performed, and 
so on. Should n = 6 fail to satisfy inequality (10), the iteration ceases 
because the packet size reduction with n > 6 is insufficient to justify 
the increase in processing time and system complexity. In this situa- 
tion the entire block of u-law PCM speech data is transmitted. 

Observe that in computing (s/n), we formulate the overall error 
[see the denominator of eq. (9)], which includes the effects of both 
quantization and interpolation. Thus, when inequality (10) is satisfied 
the overall s/n for the block of W speech samples is guaranteed to be 
above the specified minimum value of s/nyer. However, we note that if 
S/Nret is sufficiently high there will be some blocks where the quanti- 
zation noise due to p-law PCM encoding prevents inequality (10) from 
being satisfied, even though no samples are discarded. In this situation 
we do no worse than the performance of the p-law PCM encoder, i.e., 
the s/n for the block is only dependent on the encoder noise, and the 
number of code words transmitted in the block remains unchanged 
at W. 

The packetization scheme shown in Fig. 2 causes the original block 
of W speech samples, namely, {x,}, to be processed to yield a data 
sequence where words have lengths of 


w (=); n=2,...,6 
Le n 


W; n= 0, 


(11) 


This y-law PCM data sequence is removed from buffer 3, with words 
discarded where appropriate, and conveyed to the multiplexer. These 
L p-law PCM words are the digitized-speech data that are formulated 


1278 TECHNICAL JOURNAL, JULY-AUGUST 1985 


into the packet at the multiplexer. To these data must be added the 
header, which consists of a 3-bit binary number for n, followed by an 
8-bit representation of each interpolation coefficient, and a system 
flag. The end of the packet is signified by another system flag. These 
flags are dependent on the switching arrangements for a particular 
network, and because of the myriad of possible switching systems, we 
will refrain from detailing the properties of the flags. The number of 
channel-protection bits for n and {a,} is dependent on the bit error 
rate specified for the network. Again we will not concern ourselves 
with the numerous network scenarios. We can state that (1) if n is 
represented by an 8-bit word—i.e., it is equivalent in length to one 
p-law PCM word—the combination of L data words and the part of 
the header needed by the receiver to decode the speech data is 


a ae n=2,...,6 


L+1; n=o (12) 


because 2n words are required for the transmission of the interpolation 
coefficients when every nth word is discarded; and (2) when n = 
only one extra word is required to inform the receiver of the situation. 
It is not necessary that the end of packet be conveyed, because a 
knowledge of n specifies the length of the packet. By employing an 
end-of-packet flag, we are allowing for errors in the reception of n. 


3.2 Decoding the variable-length packets 

The received packet enters a buffer. The flag at the front end of the 
packet causes the control system to remove the data word representing 
n. Armed with this information, the coefficient data and digital-speech 
data are extracted from the buffer and subsequently decoded. The 
sequences {a,} and {z,} are formed (see Fig. 3), and the samples that 
were rejected at the transmitter are reinserted by adaptive interpola- 
tion to yield {Z,}. Speech is decoded in successive packets such that 
although there are L elements in {z;}, where L is a function of n, there 
are always W samples in the recovered sequence {z,}. Thus, speech 
samples are presented to the final filter at the same rate at which they 
were originally generated. 


IV. RESULTS 


The sentences “Live wires should be kept covered,” spoken by a 
male, and “To reach the end he needs much courage,” enunciated by 
a female, were concatenated, bandlimited from 0.3 to 3.2 kHz, and 
sampled at 8 kHz to yield the input speech sequence. This sequence 
was 8-bit, u-law PCM encoded, » = 255, to provide the input digital- 
speech sequence. 
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Figure 2 shows the block diagram of the variable-length packetiza- 
tion system. The u-law PCM data stream was decoded, and the 
resulting samples were directed into either buffer 1 or 2, as described 
in Section 3.1. The selection of W was 256, a value that had been 
found to provide a reasonable recovered s/n for the range of n values 
considered here.® This choice of W also ensured that the amount of 
header information to convey the interpolation coefficients was insig- 
nificant compared with the number of data words in the packet and 
that the delay in the recovered speech signal at the receiving terminal 
due to the block size was not excessive. The duration of these blocks 
of 256 speech samples was 32 ms. 

The coefficients {a,} were quantized using, for convenience, an 8- 
bit, u-law quantizer, » = 255. The scaling factor K was determined as 
follows. The system of Fig. 2 was operated as described in Section 3.1, 
but without the interpolation coefficients {a,} being quantized. How- 
ever, the value of K for each block was noted as 


V 


| a.) | max i 


Kunax (13) 
where V was the maximum range of the quantizer, namely, 4079 
arbitrary units, and | a,.)|max was the coefficient with the maximum 
magnitude. Although we considered quantizing each value of K with 
an 8-bit word and transmitting it as part of the header, we opted in 
favor of a fixed value of K. Accordingly, we set k to be the maximum 
value of K, namely, 7969, observed in the 152 blocks of input data 
used in our experiments. With this fixed K we commenced our pack- 
etization properly, quantizing the interpolation coefficients and em- 
ploying them in our adaptive interpolation procedures. The same fixed 
K was used at the receiver. 

The value of n used in our experiments was either 2, 3, 4, 5, 6, or 
infinity, depending on the minimum s/n, s/nyer, stipulated. The block 
s/n for a particular value of n, namely, (s/n),, was computed using 
the input speech samples {x,} and {Z,}. The latter sequence contained 
the original speech samples contaminated by both quantization and 
interpolation noise. Thus, the (s/n), values employed in our simula- 
tions are indicative of the quality of the recovered speech. To visually 
emphasize, in Fig. 4, those occasions when n was infinity, we arranged 
for the block s/n to “hit the stops,” namely, an arbitrary number of 
50 dB. These large excursions in s/n dramatically underline those 
times when the system was unable to reject any samples and thereby 
effect a reduction in the transmitted data rate. In our calculations of 
segmental s/n? (shown in Fig. 5) we, of course, used the original input 
speech sequence and the recovered speech sequence at the output of 
the receiver. We note that those blocks for which n equaled infinity 
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had an s/n that was the signal-to-quantization noise ratio of the 
p-law PCM speech. 

Figures 4a, c, e, and g show the variation of block s/n, i.e., the 
(s/n), for the final value of n used in the iteration, as a function of 
block number when s/nyep was 15, 20, 25, and 30 dB, respectively. The 
corresponding profiles of the value of n used in the packetization 
process are displayed in Figs. 4b, d, f, and h, respectively. For s/rer of 
only 15 dB, n = 2 occurred most frequently, and only in one block was 
the system unable to reject any samples. By contrast, when s/nrep = 30 
dB, the guaranteed speech quality was so high that the most frequent 
decision was not to discard any speech samples in the block. This 
situation occurred because the block s/n of 8-bit, u-law PCM was 
often below 30 dB. 

Observe that for every 256 data words received, 128, 85, 64, 51, 42, 


SPEECH SIGNALS = 1281 


50 | 
30 


0 20 40 60 80 100 120 140 160 
BLOCK NUMBER 


BLOCK SIGNAL-TO-NOISE 
RATIO IN DECIBELS 


rel 


VALUE OF n 


Fig. 4—(c) and (d) Block s/n and n, respectively, for s/rre¢ of 20 dB. (Cont.) 


and 0 words were discarded for transmitted packets associated with n 
of 2, 3, 4, 5, 6, and infinity, respectively. Clearly, the lower the value 
of n, the greater the data reduction. However, the value of n selected 
for a given block of W speech samples was dependent on s/nres, such 
that the lower the s/n,ez, the more probable the occurrence of a low 
value of n. Thus, as expected, the savings in the transmitted bit rate 
were at the expense of speech quality. 

The histograms of n for different values of s/nre¢ are displayed in 
Fig. 6. When a low recovered speech quality is acceptable, such as that 
obtained with s/nrep = 15 dB, we found that n = 2 was the most 
frequently selected value of n and that n > 4 was rarely used. The 
reduction in the packetized data was found to be 41 percent. As the 
S/Nep Was increased, there was an increase in the frequency of occur- 
rence of the higher values of n and a diminution in the rate at which 
lower n values occurred. The reduction in the data due to variable- 
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Fig. 4—(e) and (f) Block s/n and n, respectively for s/nre¢ of 25 dB. (Cont.) 


length packetizing was found to be 36, 28, and 20 percent for segmental 
s/n of 22.5, 25.5, and 33 dB, respectively. These reductions are dis- 
played graphically in Fig. 5 as a function of our control parameter 
S/Nre- Categorizing speech quality from segmental s/n is always con- 
tentious. However, when we included our informal listening experi- 
ences, we achieved a close approximation to toll quality speech with a 
data reduction of some 25 percent. 

Figures 7a and b show a segment of the speech signal having a 
duration of 360 ms and its spectrogram. The corresponding error 
waveform and its spectrogram for 8-bit, uy-law PCM encoding are 
displayed in Figs. 7c and d. As expected with log-PCM, the quantiza- 
tion noise power was approximately proportional to the speech-signal 
power, and the error spectrum was relatively constant over the message 
band. When the variable-length packetization of the speech signal was 
performed, the error in the recovered speech signal was composed of 
the quantization noise and interpolation noise components. The error 
magnitudes were, therefore, increased as shown in Figs. 7e and f, where 
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Fig. 4—(g) and (h) Block s/n and n, respectively, for s/rrer of 30 dB. 


S/Nep = 25 dB. The spectral magnitudes in Figs. 7d and f have the 
same arbitrary units. 

Figure 8a shows variations of the error signal for u-law PCM over 
the entire speech signal, while the corresponding waveform for the 
recovered packetization speech is displayed in Fig. 8b. The waveforms 
of Figs. 8a and b are drawn to the same scale and provide a visual 
guide to the magnitude and location of the increase in error due to the 
packetization process. We observe that there is considerable corre- 
spondence between the variations in signal amplitudes in Figs. 8a and 
b. These variations also closely correspond to those in the original 
speech signal (not shown). This is to be expected, as the quantization 
noise is approximately proportional to the speech signal, and so is the 
interpolation noise because of the control exhibited by s/Mnreg. 


V. DISCUSSION 


A variable-length packetization scheme has been proposed for 8-bit, 
p-law PCM-encoded speech. By the aid of our control parameter 
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Fig. 5—Variation of segmental s/n, and percentage of speech data reduction, as a 
function of s/Mreg. 


reference s/n we obtained a reduction in encoded speech data of 25 
percent, while the segmental s/n exceeded 30 dB, i.e., the recovered 
speech displayed negligible perceptual impairments. An important 
feature of the system is that the s/n of the recovered speech in each 
packet is required to exceed an s/nyes. Accepting s/n as a crude guide 
to quality, particularly for the relatively high s/n values employed 
here, our system endeavors to generate packets of variable length such 
that when they are decoded the speech will be maintained above a 
certain quality. 

In the scheme proposed here the length of the packets varies. The 
number of u-law PCM words in the packets could be either 256, 214, 
205, 192, 171, or 128, where the original block of speech contained 256 
words. While these packets can be transmitted as variable length with 
suitable headers and flags, they can also be transmitted as fixed-length 
packets, where the discarded speech data can be replaced with other 
types of data, such as that arising from computer traffic. 

In our proposal the packet lengths are determined for a given speech 
signal by the selection of the s/nyes, which acts as a quality control 


SPEECH SIGNALS 1285 


60 


50 


S/N vet = 30 dB 
30 ? 


PERCENTAGE OF NUMBER OF TIMES n OCCURRED 


\ / ,- S/Nye¢ = 25 dB 


9 
/ 


~. 
x s/npeg = 20 dB 
O S/N et = 15 dB 
2 3 4 5 6 co 
VALUE OF n 
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mechanism. If the multiplexer is allowed to vary the s/nyez, it can 
determine the length of the packets required for multiplexing. Thus, 
as more packets access the highways, the multiplexer control can 
discard more p-law PCM words. Because the selection of those words 
to be discarded for a given s/nyer is related to the interpolation algo- 
rithm, the degradation in recovered speech quality with increased user 
capacity is relatively smooth. 

The packetization process need not be at the user’s interface or 
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Fig. 7—Speech waveforms and spectrograms. (a) and (b) Segment of the speech signal 
and its spectrogram. (Cont.) 


terminal; it can be located at a node in the network converting 
successive groups of W y-law PCM words into variable-length packets. 
Thus, time-division-multiplexed u-law PCM channels can be statisti- 
cally multiplexed using the packetization scheme and thereby achieve 
a significant diminution in the data rate. Alternatively, a limited- 
capacity output port from the network node can accommodate more 
u-law PCM traffic by employing the packetization procedure. 
However, packetizing u-law PCM at a node in the network requires 
a different s/n,-¢ compared with when the packetization occurs at the 
user’s interface. In Fig. 2 we observe that the block s/n, namely 
(s/n),, is computed using {2,} and {x,}. The packetization equipment 
at a node in the network would not have access to the original speech 
sequence {x,} and would have to employ {x;}. In this situation, when 
no samples are discarded in a block, i.e., n is infinity, (s/n), would 
also be infinity because the sequences {Z,} and {x,} are identical. Thus, 
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Fig. 7—(c) and (d) Corresponding error waveform and its spectrogram for 8-bit, 
p-law PCM-encoded speech, » = 255. (Cont.) 
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Fig. 7—(e) and (f) Corresponding error waveform and its spectrogram when the 
variable-length packetized speech system was employed. 
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Fig. 8—Error waveforms for the speech signal. (a) Eight-bit, n-law PCM speech. (b) 
Variable-length packetized speech. 


(s/n), computed at a node in the network is the signal-to-interpolation 
noise ratio, and for a given n this ratio is higher than that obtained 
for the arrangement shown in Fig. 2. Consequently, if s/nyre¢ is specified 
in terms of packetizing at the subscriber’s interface, a higher value of 
s/Nrep is required when the packetization occurs at a node in the 
network. The important point to note, however, is that s/nyer is a 
control parameter, and it is a simple procedure to increase it in order 
to achieve the required minimum block s/n performance, or a required 
average value of n. 

Finally, we note that controlling packet length with spectral distance 
measures, and other measures more closely related to perception, 
should reduce the average transmitted bit rate per channel, but at the 
expense of added complexity. 
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